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Erratum:  Synthesis  of  Asynchronous  VLSI 

Circuits 

Alain  J.  Martin 

Department  of  Computer  Science 
California  Institute  of  Technology 
Pasadena  CA  91125,  USA 

March  22,  2000 


This  document  is  old  (1991)  and  in  several  respects  doesn’t  describe 
Caltech’s  current  approach  to  asynchronous  VLSI  design,  especially 
concerning  design  for  high  throughput.  However,  everything  in  the 
document  is  valid  and  relevant  to  today’s  design,  except  for  one  error. 

On  pp.  38  through  41,  arbiters  and  synchronizers  are  described  and 
a  circuit  implementation  is  given  for  the  arbiter.  This  implementation, 
which  has-been  used  successfully  in  many  circuits,  is  correct,  and,  as 
far  as  I  know,  it  is  the  best  implementation  of  an  arbiter. 

However,  the  document  suggests  that  the  synchronizer  can  be  im¬ 
plemented  in  the  same  way  as  the  arbiter,  although  it  does  not  actually 
give  such  an  implementation.  An  implementation  of  the  synchronizer 
similar  to  the  one  of  the  arbiter  would  be  incorrect,  as  the  circuit  could 
deadlock  in  some  pathological  cases. 

We  are  currently  writing  a  paper  describing  a  correct  implementa¬ 
tion  of  the  synchronizer.  Anybody  interested  in  getting  a  copy  should 
send  an  email  to  alain@cs.caltech.edu. 
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Chapter  1 

Introduction 


Delays  have  dangerous  ends. 
William  Shakespeare 


With  chip  size  reaching  one  million  transistors,  the  complexity  of  VLSI 
algorithms— i.e.,  algorithms  implemented  as  a  digital  VLSI  circuit— is  ap¬ 
proaching  that  of  software  algorithms — i.e.,  algorithms  implemented  as  code 
for  a  stored-program  computer.  Yet  design  methods  for  VLSI  algorithms  lag 
far  behind  the  potential  of  the  technology. 

Since  a  digital  circuit  is  the  implementation  of  a  concurrent  algorithm, 
we  propose  a  concurrent  programming  approach  to  digital  VLSI  design.  The 
circuit  to  be  designed  is  first  implemented  as  a  concurrent  program  that  ful¬ 
fills  the  logical  specification  of  the  circuit.  The  program  is  then  compiled — 
manually  or  automatically — into  a  circuit  by  applying  semantic-preserving 
program  transformations.  Hence,  the  circuit  obtained  is  correct  by  construc¬ 
tion. 

The  main  obstacle  to  such  a  method  is  finding  an  interface  that  provides  a 
good  separation  of  the  physical  and  algorithmic  concerns.  Among  the  physical 
parameters  of  the  implementation,  timing  is  the  most  difficult  to  isolate  from 
the  logical  design,  because  the  timing  properties  of  a  circuit  are  essential  not 
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only  to  its  real-time  behavior,  but  also  to  its  logical  correctness  if  the  usual 
synchronous  techniques  are  used  to  implement  sequencing. 

For  this  reason,  delay-insensitive  techniques  are  particularly  attractive  for 
VLSI  synthesis.  A  circuit  is  delay-insensitive  when  its  correct  operation  is 
independent  of  any  assumption  on  delays  in  operators  and  wires  except  that 
the  delays  be  finite[21]  .  Such  circuits  do  not  use  a  clock  signal  or  knowledge 
about  delays. 

Let  us  clarify  a  matter  of  definitions  right  away:  It  has  been  proved  in 
[14]  that  the  class  of  entirely  delay-insensitive  circuits  is  very  limited.  Dif¬ 
ferent  asynchronous  techniques  distinguish  themselves  in  the  choice  of  the 
compromises  to  delay-insensitivity. 

Speed-independent  techniques  assume  that  delays  in  gates  are  arbitrary, 
but  there  are  no  delays  in  wires[17].  Self-timed  techniques  assume  that  a  cir¬ 
cuit  can  be  decomposed  into  equipotential  regions  inside  which  wire  delays  are 
negligible[20].  In  our  method,  certain  local  ‘forks’  are  introduced  to  distribute 
a  variable  as  inputs  of  several  operators.  We  assume  that  the  differences  in 
delays  between  the  branches  of  the  fork  are  shorter  than  the  delays  in  the 
operators  to  which  the  fork  is  an  input.  We  call  such  forks  isochronic. 

Although  we  initially  chose  delay-insensitive  techniques  for  reasons  of 
methodology,  those  techniques  present  other  important  advantages  in  terms 
of  efficiency  and  robustness: 

•  The  clock  rate  of  a  synchronous  design  has  to  be  slowed  to  account 
for  the  worst-case  clock  skews  in  the  circuit,  and  for  the  slowest  step  in  a 
sequence  of  actions.  Since  delay-insensitive  circuits  do  not  use  clocks,  they 
are  potentially  faster  than  their  synchronous  equivalent. 

•  Since  the  logical  correctness  of  the  circuits  is  independent  of  the  val¬ 
ues  of  the  physical  parameters,  delay-insensitive  circuits  are  very  robust  to 
variations  of  these  parameters  caused  by  scaling  or  fabrication,  or  by  some 
non-deterministic  behavior  such  as  the  metastability  of  arbiters.  For  instance, 
all  the  chips  we  have  designed  have  been  found  to  be  functional  in  a  range  of 
voltage  values  (for  the  constant  voltage  level  encoding  the  high  logical  value) 
from  above  10V  to  below  IV. 

•  Delay-insensitive  circuit  design  can  be  modular:  A  part  of  a  circuit  can 
be  replaced  by  a  logically  equivalent  one  and  safely  incorporated  into  the 
design  without  changes  of  interfaces. 

•  Because  an  operator  of  a  delay-insensitive  circuit  is  “fired”  only  when  its 
firing  contributes  to  the  next  step  of  the  computation,  the  power  consumption 
of  such  circuit  can  be  much  lower  than  that  of  its  synchronous  equivalent. 

•  Since  the  correctness  of  the  circuits  is  independent  of  propagation  delays 
in  wires  and,  thus,  of  the  length  of  the  wires,  the  layout  of  chips  is  facilitated. 

The  method  indeed  produces  correct  and  efficient  circuits.  It  has  been  ap¬ 
plied,  both  with  “hand  compilation”  and  automatic  compilation,  to  a  series 
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of  difficult  design  problems,  such  as  distributed  mutual  exclusion,  fair  arbi¬ 
tration,  routing  automata,  stack,  and  serial  multiplier.  All  fabricated  chips 
have  been  found  to  be  correct  on  “first  silicon”.  Although  our  CMOS  imple¬ 
mentation  of  the  basic  operators  has  been  overly  cautious,  and  the  electrical 
optimization  techniques  have  been  rather  tame,  the  performance  of  the  chips 
has  been  found  to  be  at  least  equal  to  that  of  synchronous  implementations. 
We  have  just  completed  the  design  of  a  general-purpose  microprocessor,  and 
its  performances  are  very  encouraging:  in  1.6 gm  SCMOS,  it  runs  at  18  million 
instructions  per  second.  (See  later  for  more  detail.) 

The  main  reason  for  the  efficiency  of  the  method  is  that,  rather  than 
going  in  one  step  from  program  to  circuit,  the  designer  applies  a  series  of 
transformations  to  the  original  program.  At  each  stage,  powerful  algebraic 
manipulations  can  be  performed  leading  to  important  optimizations  in  terms 
of  speed  or  area. 

The  most  encouraging  aspect  of  the  method  is  that  it  is  really  a  synthesis 
technique:  it  allows  a  designer  to  construct  solutions  that  he  would  never  have 
found  had  he  not  applied  the  method.  We  shall  observe  that  different  appli¬ 
cations  of  the  transformations  lead  to  different  circuits  for  the  same  program. 
Although  all  circuits  are  semantically  equivalent,  they  may  exhibit  different 
behaviors  in  terms  of  speed  or  size  (number  of  operators  used).  The  method 
therefore  includes  the  trade-offs  between  simplicity  and  efficiency  that  should 
be  available  to  the  VLSI  designer. 

Using  concurrency  to  implement  a  sequential  computation  may  seem  waste¬ 
ful  at  first  sight.  But  VLSI  is  essentially  a  concurrent  medium:  concurrency  is 
implemented  at  no  cost  by  mere  juxtaposition  of  the  concurrent  parts.  On  the 
other  hand,  implementing  sequencing  requires  synchronization  and  is,  in  gen¬ 
eral,  more  expensive.  We  shall  therefore  implement  sequencing  as  restricted 
concurrency.  Once  a  process  has  been  transformed  into  a  semantically  equiv¬ 
alent  set,  the  problem  of  implementing  sequencing  has  disappeared! 

This  technique  entails  one  of  the  main  novelties  of  the  method.  Other 
techniques  implement  sequencing  by  transforming  the  computation  into  a 
finite-state  machine,  and  realizing  each  state  with  a  state-holding  element.  In 
our  technique,  some  state-holding  elements  may  be  needed:  we  shall  see  that 
in  the  transformation  from  sequences  to  PR  set  we  may  have  to  introduce 
so-called  state  variables  which  correspond  to  state-holding  elements.  But  the 
number  of  those  elements  is  drastically  less  than  in  techniques  using  finite- 
state  machines. 

We  first  introduce  the  “source  code”  notation,  called  Communicating  Hard¬ 
ware  Processes ,  or  CHP,  which  is  a  concurrent  programming  notation  inspired 
by  C.A.R.  Hoare’s  CSP[5]:  A  program  is  a  set  of  concurrent  processes  commu¬ 
nicating  by  input  and  output  commands  on  channels.  Second,  we  describe  the 
object  code  notation,  called  production  rule  set ,  which  is  an  entirely  concur- 
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rent  programming  paradigm:  AH  enabled  commands  can  be  fired  concurrently 
at  any  time.  This  notation  is  one  of  the  main  innovations  of  the  method  and 
is  an  interesting  notation  for  digital  VLSI  all  by  itself. 

Next,  we  describe  the  four  main  steps  of  the  compilation  (process  decom¬ 
position,  handshaking  expansion,  production  rule  expansion,  operator  reduc¬ 
tion)  and  illustrate  them  with  a  number  of  examples.  In  particular,  we  present 
the  different  algebraic  transformations  that  can  be  applied  at  different  stages 
of  the  compilation,  and  which  give  the  method  its  flexibility  and  efficiency. 


Chapter  2 

Communicating  Hardware 
Processes 


The  source  notation  is  a  program  notation  and  not  a  hardware  description  lan¬ 
guage.  It  is  inspired  by  C.A.R.  Hoare’s  CSP[5]  and  E.W.  Dijkstra’s  guarded 
commands[3],  and  is  based  on  assignment  and  process  communication  by 
message-  passing. 


2.1  Data  Types  and  Assignment 

The  only  basic  data  type  is  the  boolean.  The  other  types — integer  and  floating 
point— are  collections  of  booleans  that  we  represent  in  the  PASCAL  record 
notation. 

For  b  boolean,  the  command  b  :t=  true,  also  denoted  b  is  the  assignment 
of  the  value  true  to  b.  Similarly,  the  command  b  :=  false,  also  denoted  b  j, 
is  the  assignment  of  the  value  false  to  b. 

An  integer  of  “length”  n  is  a  predefined  record  type  consisting  of  n  boolean 
components  (“fields”  in  the  PASCAL  jargon).  For  instance,  if  x  is  declared 
as  an  integer  of  length  8,  then  £  is  a  collection  of  the  8  boolean  variables:  a;.0, 
x.l,  x.2,. . .,  x.7. 

The  existence  of  this  predefined  record  type  for  integers  does  not  preclude 
the  programmer  from  introducing  other  records  to  structure  the  data.  For 
instance,  in  the  program  of  the  microprocessor,  which  we  will  introduce  later, 
the  integer  variable  i  represents  (contains)  the  currently  executed  instruction. 
This  value  is  declared  as  a  record  of  several  types  depending  on  the  type  of 
the  instruction.  For  ALU  instructions  and  ordinary  memory  instructions,  the 
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type  is: 

alu  =  record 

op  :  alu.  15.. alu.  12 
x  :  alu.ll..alu.8 
y  :  alu.7..alu.4 
z  :  alu.3..alu.O 
end, 

where  the  field  op  contains  the  “opcode”  of  the  instruction,  the  fields  x 
and  y  contain  the  indices  of  the  registers  to  be  used  as  parameters  of  the 
instruction,  and  z  contains  the  index  of  the  register  in  which  the  result  of  the 
instruction  execution  is  to  be  stored. 

Since  operations  on  boolean  variables  are  the  only  primitive  operations, 
any  operation  on  other  data  types  appearing  in  a  program  must  be  understood 
to  be  a  shorthand  notation  or  function  call  for  the  sequence  of  operations  on 
boolean  variables  that  will  implement  it. 

For  instance,  given  two  integers  x  and  y  of  the  same  length  n,  the  assign¬ 
ment 

V  ■=  x 

is  a  shorthand  notation  for  the  multiple  assignment 

V- 0,  y.l,  ...,  y.(n  —  1)  :=  x.0,  x.l,  ...,  s.(n-l). 

The  multiple  assignment  of  n  expressions  to  n  variables  is  different  from 
the  concurrent  composition  (which  we  will  introduce  shortly)  of  the  n  ele¬ 
mentary  assignments.  In  the  multiple  assignment,  the  n  expressions  are  all 
evaluated  before  the  results  are  assigned  to  the  corresponding  variables. 

For  the  sake  of  clarity,  we  will  use  the  usual  integer  arithmetic  operators 
(for  instance,  y  :=  x  +  1  in  the  program  of  the  microprocessor)  in  the  first 
description  of  an  algorithm.  However,  since  these  operators  are  not  primi¬ 
tive  constructs  of  the  language,  they  are  subsequently  replaced  with  calls  to 
functions  that  implement  the  operators  in  terms  of  boolean  operations. 


2.2  Arrays 

The  array  mechanism  is  an  address-calculation  mechanism,  and  is  used  when 
the  identity  of  the  element  in  a  set  of  variables  that  is  to  be  used  for  some 
action  will  be  determined  during  the  computation.  For  example,  the  proces¬ 
sor  uses  three  arrays:  the  instruction  memory  array,  imem ,  whose  index  is 
the  program  counter,  pc;  the  data  memory  array,  dmem\  and  the  array  of 
general-purpose  registers,  reg.  Hence,  the  execution  of  a  load  instruction,  i, 
is  described  by  the  assignment: 

reg[i.z]  :=  dmem[reg[i.x ]  +  rep^.j/]] . 


2.3.  COMPOSITION  OPERATORS 
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In  this  example,  reg[i.z ]  represents  the  register  whose  location  (address  in 
the  array)  is  the  current  value  of  the  field  2  of  the  current  instruction  i.  And 
similarly  for  reg[i.x]  and  reg[i.y\.  The  assignment  assigns  to  reg[i,z\  the  value 
of  the  element  in  the  array  dmern  at  a  location  which  is  the  sum  of  the  contents 
of  registers  reg[i.x]  and  reg\i.y\. 


2.3  Composition  Operators 

There  are  three  composition  operators  (also  called  “constructors”):  the  se¬ 
quential  operator,  represented  by  the  semicolon;  the  concurrent,  or  parallel, 
operator  represented  by  the  parallel  bar,  ||;  and  the  coincident  operator,  rep¬ 
resented  by  the  bullet. 

The  semantics  of  the  sequential  composition  51;  52  are  well  known:  “First 
execute  51  and  then  execute  52.”  The  semicolon  is  associative,  but  of  course 
not  commutative. 

We  will  assume  that  the  semantics  of  the  parallel  composition  are  also  well 
known,  although  we  are  aware  of  how  difficult  it  is  to  define  these  semantics 
formally  and  simply:  51  II  52  denotes  the  parallel,  or  concurrent,  execution 
of  51  and  52. 

We  postulate  that  the  parallel  composition  is  weakly  fair:  If  at  a  certain 
point  of  the  computation  of  51  ||  52,  x  is  the  next  atomic  action  of  51,  then 
x  will  be  executed  after  a  finite  number  of  atomic  actions  of  52. 

Parallel  composition  is  associative  and  commutative. 

The  bullet  operator  is  used  solely  to  compose  communication  commands. 
(Communication  commands  will  be  introduced  later.)  Furthermore,  the  co¬ 
incident  composition  of  two  communication  commands  is  defined  only  if  the 
two  commands  are  non-interfering:  Two  programs  are  non  -interfering  if  a 
variable  modified  by  one  program  is  not  used  by  the  other  program. 

For  51  and  52  non-interfering  communication  commands,  if  the  executions 
of  both  51  and  52  in  a  certain  state  of  the  computation  terminate,  then  the 
execution  of  51  •  52  in  that  state  terminates.  Furthermore,  the  completion  of 
51  coincides  with  the  completion  of  52;  i.e.,  51  and  52  are  completed  in  the 
same  state  of  the  computation.  (We  will  return  to  this  definition  later  when 
we  define  the  notion  of  completion  of  a  non-atomic  action.) 

The  bullet  operator  is  associative  and  commutative. 

If  51  and  52  are  non-interfering  communication  commands,  the  execution 
of  51  ||  52  is  equivalent  to  the  execution  of  either  51;  52  or  52;  51  or  51  •  52. 

The  bullet  has  the  highest  priority,  followed  by  the  semicolon,  followed  by 
the  parallel  bar: 


50  •  51; 52  ||  53  =  ((50  *  51);  52)  ||  53. 
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2.4  Control  Structures 

The  two  control  structures  are  the  selection  and  the  repetition  of  Dijkstra’s 
guarded  commands.  However,  the  VLSI  programmer  and  the  software  pro¬ 
grammer  adopt  opposite  attitudes  towards  non-determinism.  Whereas  the 
latter  is  encouraged  to  maximize  non-determinism  as  a  way  to  avoid  unneces¬ 
sary  choices,  the  former  is  requested  to  minimize  non-determinism  to  reduce 
the  high  cost  of  arbitration  in  a  direct  VLSI  implementation  of  a  set  of  guarded 
commands. 

It  is  very  difficult,  if  at  all  possible,  to  determine  at  “compile-time”  which 
selections  require  arbitration.  We  therefore  introduce  two  sets  of  control  struc¬ 
tures,  a  deterministic  set  and  a  non-deterministic  set,  and  let  the  programmer 
explicitly  indicate  where  arbitration  is  needed. 

2.4.1  Selection 

The  execution  of  the  deterministic  selection  command 

[G\  — »  Sr|. . .  |Gn  — >  5„], 

where  G i  through  Gn  are  boolean  expressions,  S\  through  Sn  are  program 
parts,  (Gi  is  called  a  “guard,”  and  Gi  — »  is  called  a  “guarded  command”) 
amounts  to  the  execution  of  the  arbitrary  Si  for  which  Gi  holds.  At  any  time 
at  most  one  guard  holds.  If  none  of  the  guards  is  true  ,  the  execution  of  the 
command  is  suspended  until  one  guard  is  true  , 

The  non-deterministic  selection  command 

[Gi  —>  Si  | . . .  \G„  — >  5„] 

is  identical  to  the  previous  one,  except  that  several  guards  may  be  true  at 
the  same  time.  In  such  a  case,  an  arbitrary  true  guard  is  selected. 

2.4.2  Repetition 

The  execution  of  the  deterministic  repetition  command 

*[Gi  -»  5r[. . .  [G„  — >  .Sn], 

where  G i  through  Gn  are  boolean  expressions,  and  S\  through  Sn  are  program 
parts,  amounts  to  repeatedly  selecting  the  arbitrary  Si  for  which  Gi  holds, 
and  executing  S{.  At  any  time,  at  most  one  guard  holds.  If  none  of  the  guards 
is  true  ,  the  repetition  terminates. 

The  non-deterministic  repetition  command 


*[Gi  — >  Si  | . . .  |G„  — * 


2.5.  THE  REPLICATION  CONSTRUCT 
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is  identical  to  the  previous  one,  except  that  several  guards  may  be  true  at 
the  same  time.  In  such  a  case,  an  arbitrary  true  guard  is  selected. 

[G],  where  G  is  a  boolean  expression,  stands  for  [G  — *■  skip],  and  thus  for 
“wait  until  G  holds.”  (Hence,  “[G];  S”  and  [G  — +  S]  are  equivalent.) 

*[S]  stands  for  *[true  — ►  5]  and,  thus,  for  “repeat  S  forever.” 

2.4.3  Reactive  Process  Structure 

From  the  preceding  definitions,  the  operational  description  of  the  statement 

*l[Gi  — ►  Si|.  .  .  |Gri  Snj] 

is  “repeat  forever:  Wait  until  some  G;  holds;  execute  an  Si  for  which  Gj 
holds.”  This  structure,  which  we  call  “reactive,”  is  used  very  frequently.  For 
instance,  the  server  processes  in  the  distributed  mutual  exclusion  example  are 
reactive  processes. 


2.5  The  Replication  Construct 

Both  because  of  the  restriction  of  basic  operations  to  booleans  and  because 
of  the  high  degree  of  concurrency  of  VLSI  algorithms,  such  algorithms  are 
characterized  by  an  extensive  use  of  replication.  A  typical  example  is  that 
some  action  has  to  be  performed  (sequentially  or  concurrently)  on  all  the 
boolean  variables  that  represent  an  integer.  Another  example  is  that  of  an  n- 
place  buffer  constructed  as  the  concurrent  composition  of  n  identical  one-place 
buffers. 

The  notation  therefore  contains  a  syntactic  operator,  called  the  replication 
construct,  which  makes  it  possible  to  “clone”  any  program  part  into  a  number 
of  instances. 

The  replication  mechanism  is  used  to  represent  a  fixed,  finite,  and  non¬ 
empty  list  of  syntactic  objects.  Operationally,  we  can  say  that  the  replication 
mechanism  is  used  to  generate  a  list  of  objects  at  compile-time.  An  element 
of  the  list  is  any  program  part.  The  concatenation  operator  of  the  list  is  any 
constructor  or  separator  of  the  language.  The  constructors  are  the  semicolon 
for  sequential  composition,  and  the  comma  and  the  parallel  bar  for  parallel 
composition.  The  separators  are  the  bar  for  guarded  commands,  and  the 
blank  and  the  comma  for  lists  of  declarations. 

Recursion  is  the  basic  mechanism  for  creating  such  a  list.  Since  it  is 
often  convenient  to  “unroll”  the  simplest  form  of  tail  recursion  as  an  iteration 
mechanism,  both  iteration  and  recursion  are  available. 

The  construct  for  replication  by  iteration  is  defined  as  follows:  If 

•  op  is  any  constructor  or  separator, 
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•  i  is  an  integer  variable,  called  the  running  index , 

•  the  range ,  defined  by  n..m,  where  n  and  m  are  integer  constants,  is  not 
empty,  i.e.,  n  <  m. 


•  S(i)  is  any  program  part  in  which  i  appears  free, 
then, 


(op*  : 


S(n), 

5(n)  op  (op  i  :  n  +  l..m 


m, 


if  n  =  m 
if  n  <  m 


For  n  <  m,  the  definition  is  ambiguous  if  op  is  not  associative.  In  this 
case  the  definition  is  taken  to  be  equivalent  to 


5(n)  op  ({  op  i  :  n  +  L.m  :  5(f)}). 

The  bracket  notation  for  replication  is  borrowed  from  Chandy  and  Misra[2] . 
who  use  it  for  defining  so-called  quantified  expressions.  Observe  that  a  repli¬ 
cation  command  is  not  a  quantified  expression. 

For  example,  the  construct  [(|i :  0..3  :  G(i)  — >  S(i}}]  expands  to 

[G(0)  -  5(0) 

|G(1)-5(1) 

JG(2)  ->  5(2) 

1<?(3)  -  5(3) 


The  construct 

(;i :  0..2  :  x.i  :=  y.{(i  +  l)mod3)) 

expands  to 

a;.0:=y.l;  x.l:=y.2\  x.2:—y.O. 

Replication  constructs  can  be  nested  as  in  the  following  example: 

(,*  :  0..9  :  {,j  :  0 ..i  :  x{i,j)  =  0))). 

2.6  Procedures  and  Functions 

Procedures  are  used  with  a  simple  parameter  mechanism:  A  parameter  is 
either  input  or  output.  For  procedure  p,  declared  as 

procedure  p(x  :  input,  y  :  output );  5 

the  call  p(a,  b)  is  equivalent  to  the  program  part 

x  a;  5:  b  :=  y. 


2.7.  CONCURRENT  PROCESSES 
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A  parameter  of  a  function  is  always  an  input  parameter.  For  function  y, 
declared  as 

function  y{x)\ S 

where  5  is  the  same  program  part  as  in  procedure  p,  a  statement  Q  containing 
the  function  call  y(a)  is  equivalent  to  the  program  part 

p(a,6);<3f(o), 

where  6  is  a  “fresh’"  variable. 

Tail  recursion  is  allowed  but  not  general  recursion,  since  general  recursion 
requires  the  construction,  at  execution  time,  of  a  stack  whose  size  may  vary 
with  the  parameters  of  the  computation. 


2.7  Concurrent  Processes 

The  main  building  block  for  the  construction  of  concurrent  computations  is 
the  process.  In  the  design  of  the  microprocessor  for  instance,  each  stage  of 
the  pipeline  is  a  process.  Concurrent  composition  of  processes  is  also  the 
main  source  of  concurrency,  although  we  allow  the  concurrent  composition  of 
statements  inside  processes.  In  strict  communicating-process  design  style,  a 
variable  is  local  to  a  process,  and  communication  among  processes  is  uniquely 
by  way  of  message  exchanges.  In  the  design  of  the  processor,  we  have  vio¬ 
lated  this  rule  and  allowed  processes  to  share  variables  in  a  restricted  way:  A 
variable  of  one  process  may  be  inspected  by  another  process.  (Whether  this 
relaxation  of  the  locality  rule  is  a  useful  extension  or  a  weakness  of  the  flesh 
is  not  clear  at  the  moment.  More  experimentation  is  necessary.) 

Hence,  the  most  common  structure  for  the  body  of  a  concurrent  compu¬ 
tation  is  the  parallel  construct: 

pi  ||  p2  ||  . . .  ||  pn 

where  pi  through  pn  are  the  names  of  processes  that  have  been  declared 
beforehand.  A  process  is  used  very  much  as  a  procedure  is  used:  It  is  first 
declared  in  a  declaration  statement  and  then  called  by  using  its  name  in 
a  statement.  Several  instances  of  the  same  process  type  can  be  called  by 
assigning  different  names.  But,  unlike  procedures,  each  {instance  of  a)  process 
can  be  called  only  once. 

2.7.1  Communication  Commands,  Ports,  and  Channels 

Processes  communicate  with  each  other  by  using  communication  commands 
on  ports.  A  port  of  a  process  is  paired  with  a  port  of  another  process  to 
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form  a  channel.  For  the  time  being,  we  assume  that  a  channel  is  shared  by 
exactly  two  processes;  later,  we  will  generalize  the  definition  to  more  than  two 
processes.  For  instance,  the  microprocessor  uses  one-to-one,  one-to-many,  and 
many-to-many  (buses)  channels. 

A  process  is  either  elementary  or  composite.  The  ports  of  an  elementary 
process  are  external:  Each  is  to  be  connected  by  a  channel  to  a  port  of  another 
process  to  form  a  composite  process.  The  external  ports  of  a  process  are 
declared  in  the  heading  of  the  process,  like  the  parameters  of  a  procedure: 

p  =  process(JJ,i) 

(Later  on,  we  will  add  some  type  information  to  the  declaration.)  A  composite 
process,  p,  is  the  parallel  composition  of  several  processes.  The  ports  of 
a  component  process  that  are  connected  by  a  channel  to  ports  of  another 
component  process  are  internal  to  p.  The  ports  of  the  components  that  are 
left  unconnected  (dangling)  are  the  external  ports  of  p.  The  internal  ports 
and  the  channels  are  defined  by  channel  declaration  in  the  process  body. 

We  use  two  equivalent  naming  mechanisms  for  ports  and  channels.  The 
first  one  gives  local  names  to  ports  and  pairs  the  two  ports  of  a  channel.  For 
example,  let  two  processes,  pi  and  p2,  share  a  channel  with  port  X  in  pi  and 
port  Y  in  p2.  The  declaration  is  as  follows: 

pi  =  process  (X) . . .  end 
p2  =  process(y) . . .  end 
pi  ||  p2 

chan  (pi  .X,  p2 .  Y ) 

The  second  mechanism  gives  global  names  to  channels,  and  uses  the  chan¬ 
nel  names  for  all  ports  of  the  same  channel.  For  instance,  the  same  two 
processes  would  be  described  as: 

pi  =  process(C) . . .  end 
p2  =  process(C') . . .  end 
pi  |j  p2 

chan  C 

We  prefer  local  names  for  ports  when  the  processes  involved  are  identical 
(as  in  the  case  of  the  server  processes  in  the  distributed  mutual-exclusion 
example);  we  prefer  global  names  when  the  processes  are  different  because 
this  reduces  the  nomenclature.  (We  have  used  global  names  in  the  description 
of  the  processor.) 

If  the  channel  is  used  only  for  synchronization  between  the  processes,  the 
name  of  the  port  is  sufficient  for  identifying  a  communication  on  this  port. 
For  instance,  in  the  program  for  distributed  mutual  exclusion,  the  channel 
between  a  ‘"master”  process  and  its  “server”  process  is  identified  with  port  D 
in  the  master  and  port  U  in  the  server,  and  is  used  for  synchronization  only. 


2. 7.  CONCURRENT  PROCESSES 
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2.7.2  Semantics  of  Synchronization 

Since  a  message  cannot  be  received  before  it  has  been  sent,  communications 
actions  on  the  two  ports  of  a  same  channel  have  to  be  synchronized.  The 
weakest  form  of  synchronization  between  the  send  actions  on  one  port  of  a 
channel  and  the  receive  actions  on  the  other  port  of  the  same  channel  is  that 
at  any  moment  the  number  cR  of  completed  receive  actions  is  at  most  equal 
to  the  number  cS  of  completed  send  actions: 

cR  <  cS 

The  difference  cS  —  cR  is  the  number  of  messages  sent  that  have  not 
yet  been  received.  These  messages  have  to  be  buffered  somewhere  “in  the 
channel.”  Allowing  message  buffering  in  the  channels  obviously  implies  that 
channels  be  implemented  as  complex  storage  devices.  In  view  of  our  inten¬ 
tion  to  use  communication  as  an  elementary  sequencing  and  synchronization 
mechanism,  we  want  to  opt  for  as  simple  an  implementation  of  channels  as 
possible.  Clearly,  the  simplest  implementation  is  one  in  which  no  buffering 
of  messages  is  required.  In  turn,  this  choice  implies  that  the  synchroniza¬ 
tion  between  send  and  receive  actions  on  a  channel  be  such  that  at  any  time 
cR  =  cS.  Hence  the  following  definition  of  the  synchronization  property  of 
communication  primitives. 

If  two  processes,  pi  and  p2,  share  a  channel  with  port  X  in  pi  and  port  F 
in  p2,  then,  at  any  time,  the  number  of  completed  X-actions  in  pi  will  equal 
the  number  of  completed  F-actions  in  p2;  in  other  words,  the  completion  of 
the  n-th  X-action  “coincides”  with  the  completion  of  the  n-th  F-action. 

If,  for  example,  pi  reaches  the  n-th  X-action  before  p2  reaches  the  n-th 
F-action,  the  completion  of  X  is  suspended  until  p2  reaches  F.  The  X-action 
is  then  said  to  be  pending.  When,  thereafter,  p2  reaches  F,  both  X  and  F 
are  completed.  The  predicate  “X  is  pending”  is  denoted  as  qX. 

If,  for  an  arbitrary  command  A,  c A  denotes  the  number  of  completed 
A-actions,  the  semantics  of  a  pair  (X,  F)  of  communication  commands  is 
expressed  by  the  two  axioms: 

cX  =  cF 
-iqX  V  -iqF 


2.7.3  Probe 

Instead  of  the  usual  selection  mechanism  by  which  a  set  of  pending  commu¬ 
nication  actions  can  be  selected  for  execution,  we  provide  a  general  boolean 
command  on  channels,  called  the  probe.  The  definition  of  the  probe[6]  states 
that  the  probe  command  X  in  process  pi  has  the  same  value  as  qF,  and, 
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symmetrically,  the  probe  command  F  in  process  p2  has  the  same  value  as 
qX.  _ 

Hence,  in  the  guarded  command  X  — *  X,  the  X-action  is  not  suspended 
since  q F  holds  as  a  precondition  of  X. 

Remark:  In  view  of  our  declared  intention  to  implement  processes  in  a 
distributed  and  delay-insensitive  way,  our  choice  of  definitions  for  communi¬ 
cation  may  already  puzzle  some  readers:  The  definition  of  A 1  relies  on  the 
simultaneous  completion  of  two  actions  in  two  different  processes,  and  the 
value  of  the  probe  in  one  process  is  supposed  to  be  identical  to  a  suspended 
state  of  another  process.  A  short  explanation  is  that  we  have  chosen  defini¬ 
tions  of  completion  and  suspension  that  are  unorthodox  but  valid!  Q 

2.7.4  Example 

Process  sel  repeatedly  performs  communication  action  X  or  communication 
action  Y,  whichever  can  be  completed;  sel  is  blocked  if  and  only  if  neither  X 
nor  Y  can  be  completed.  The  program  body  of  sel  is: 

*[[X->  X|F  -*  Y}} 

Obviously,  process  sel  is  not  fair,  because  of  the  non-deterministic  choice  of 
a  guard  when  both  guards  are  true.  Negated  probes  make  it  possible  to 
transform  sel  into  a  fair  version,  fsel,  whose  body  is: 

*[[X->X;  [F  -»  F||-*F ->  skip] 
jy  — >  F;  [X  — >  X|-iX  -  skip] 

]]• 

This  example  illustrates  the  fact  that  negated  probes  are  necessary  for  imple¬ 
menting  fairness. 

2.7.5  Communication 

Matching  communication  actions  are  also  used  to  implement  a  form  of  dis¬ 
tributed  assignment  statement,  to  “pass  messages”  as  it  is  often  said.  In  that 
case,  the  pair  of  commands  is  specified  to  consist  of  an  input  command  and  an 
output  command  by  adjoining  to  them  the  symbols  “?”  and  respectively. 
For  example,  X?  is  an  input  command  and  then  X  is  an  input  port,  and  F! 
is  an  output  command,  and  then  F  is  an  output  port. 

Communication  axiom.  Let  Xlu  and  Ylv  be  matching,  where  u  is  a  pro¬ 
cess  variable,  and  v  is  an  expression  of  the  same  type  as  u.  The  communica¬ 
tion  implements  the  assignment  u  :=  v.  In  other  words,  if  v  —  V  before  the 
communication ,  u  —  V  and  v  =  V  after  the  communication. 


2.8.  EXAMPLES 
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2.8  Examples 

In  this  section,  we  illustrate  the  notation  with  a  number  of  typical  exam¬ 
ples.  The  programs  are  given  with  a  brief  informal  explanation.  All  proofs  of 
correctnes  are  omitted. 


2.8.1  Stream  Merge 

A  process  has  two  input  ports  X  and  Y,  and  an  output  port  Z.  The  process 
outputs  on  port  Z  a  stream  of  messages  which  is  an  arbitrary  merge  of  the 
stream  of  messages  received  on  X  and  the  stream  of  messages  received  on  Y. 
(The  type  of  the  messages  is  irrelevant.  For  the  completeness  of  the  declara¬ 
tions,  let  us  assume  they  are  integer  of  size  8.)  The  streams  received  on  X  and 
Y  can  each  be  either  empty,  or  finite,  or  infinite.  Because  of  the  possibility 
that  no  message  will  be  received  on  an  input  port  in  a  current  state  of  the 
system,  an  input,  port  has  to  be  probed  before  each  input  communication  on 
the  port  in  order  to  avoid  deadlock.  The  solution  is: 

MERGE  =  process(X?mt(8),y?mt(8),.Z!mt(8)) 
u  :  int( 8) 

*[  [T^X?u;Zlu] 

||  Y  -»  Ylu\  Zlu] 

] 

]end 

A  number  of  remarks  are  in  order.  First,  observe  that  the  process  has  the 
typical  “reactive  process’’  structure  mentioned  in  Subsection  2.4.3.  Second,  in 
absence  of  any  other  specification,  we  have  to  assume  that  both  probes  may  be 
true  at  the  same  time  if  there  are  pending  communications  on  both  input  ports 
at  the  same  time.  We  therefore  had  to  use  the  nondeterministic  version  of  the 
selection  statement.  Third,  the  above  solution  requires  an  internal  variable 
u  of  the  same  type  as  the  messages  to  buffer  the  last  message  received  and 
not  yet  sent.  But  such  a  buffering  is  expensive  and,  in  this  case  at  least, 
unnecessary.  We  can  directly  output  on  Z  the  message  being  received  on  X 
or  Y.  Instead  of  X?u;  Zlu,  we  can  write  Z!(X?).  And  similarly  for  the  other 
guarded  command. 


2.8.2  Buffers 

Next,  we  construct  a  one-place  buffer.  The  process  inputs  8-bit  integer  mes¬ 
sages  on  the  input  port  L ,  and  outputs  them  in  the  same  order  on  the  output 
port  R. 
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BUF1  =  process(L?mt(8),  R\int(8>)) 
x  :  int( 8) 

*[L7x\  ilia;] 

end 

(Like  the  previous  example,  the  above  process  can  be  implemented  without 
introducing  the  internal  variable  x.) 

A  buffer  of  size  n  can  be  constructed  as  the  linear  composition  of  n  one- 
place  buffers. 

1  BUF(n)=  process(L?mt(8),  i£!mt(8)) 

2  p(i :  0 ..n  —  1)  :  BUF 1 

3  (\\i :  0..n  -  1  :  p(i)} 

4  chan(i  :  0..n  —  2  :  ( p(i).R,p(i  +  1).L)) 

5  p{0).L  =  BUF(n).L 

6  p(n-l).R  =  BUF{n).R 

7  end 


Let  us  briefly  explain  the  different  commands  of  this  declaration.  Line  1  is 
the  usual  heading  which  contains  the  declaration  of  the  external  ports  of  the 
process,  here  L  and  R  both  of  the  type  “integer  of  size  8.”  Line  2  is  the 
declaration  of  n  internal  processes  p(0)  through  pin  -  1)  of  the  type  one-place 
buffer  {BUF1).  Line  3  is  the  body  of  the  process  which  consists  of  the  parallel 
composition  of  the  n  one-place  buffers  previously  declared. 

Line  4  describes  how  the  internal  ports  are  connected  to  form  internal 
channels.  Line  5  and  line  6  are  the  identification  of  the  external  ports  with 
two  ports  of  the  internal  processes. 


2.8.3  A  Lazy  Stack 

We  implement  a  stack  S  of  size  n,  n  >  0,  as  a  string  of  n  communicating 
processes  defined  as  follows: 

e  _  /  ^  i  if  n  =  1, 

*“\(fc||T),  if  n  >  1, 

where  h ,  the  head  of  the  stack,  is  a  process,  and  T,  the  tail  of  the  stack, 
is  a  stack  of  size  n  —  1.  Process  h  communicates  with  the  environment  of 
the  stack  by  the  communication  actions  in?x  and  out\x,  and  with  T  by  the 
communication  actions  put\x  and  getlx.  Hence,  h.put  matches  T.in ,  and  h.get 
matches  T.out.  (We  assume  that  no  attempt  is  ever  made  to  add  a  portion 
to  a  full  stack,  or  to  remove  a  portion  from  an  empty  stack.) 

Each  stack  element  is  either  empty  and  behaves  as  procedure  E,  or  is 
full  and  behaves  as  procedure  F.  The  epithet  “lazy”  is  attributed  to  this 


2.8.  EXAMPLES 


21 


stack  because  no  reshuffling  of  portions  takes  place  after  a  portion  has  been 
removed  from  a  full  stack  element.  Hence,  the  full  portions  in  the  stack  are 
not  necessarily  contiguous. 

E  =  procedure 

[m  — >  in?x\  F 

joiit  — *  get?x;  outlx;  E 

jend 

F  =  procedure 

[out  — i  outlx;  E 
j in  — *  putlx;  in? x;  F 

jend  . 

If  we  assume  that  a  stack  element  is  initially  empty,  such  an  element  is  de¬ 
scribed  by  the  following  process: 

stack  —  element  =  process(in?int(8),out\int(8),get?int(8),put\int{8)) 
x  :  int( 8) 

E 

end 

The  following  alternative  coding  of  the  body  of  the  stack  element  process, 
due  to  Peter  Hofstee,  illustrates  the  advantages  of  the  probe  construct: 

E  =  *[  [in  — >  inlx 

j  out  — >  get?x 

]; 

[out  -*  outlx 
Jin  — » putlx 
]]• 

2.8.4  Palindrome  Recognizer 

A  palindrome  is  a  finite  sequence  of  characters  (word,  sentence)  that  reads  the 
same  backward  and  forward.  Discounting  the  difference  between  uppercase 
and  lowercase,  the  sentence  “Able  was  I  ere  I  saw  Elba”  is  a  palindrome.  In 
other  words,  the  sequence  S(i  :  0..n  —  1)  is  palindrome  if  and  only  if: 

V* :  0-[?j  —  1 :  <S'(i)  =  S(n  —  1  —  t) 

£t 

We  want  to  design  a  process  that  determines  which  prefixes  of  a  given 
sequence  of  at  most  m  characters  are  palindromes. 

More  precisely,  the  environment  behaves  as  the  process  (body) 


*]putlx;  get?b] 
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where  x  is  a  character  and  b  is  the  boolean  whose  value  is  equal  to  the  predicate 
“the  sequence  of  characters  transmitted  on  port  put  so  far  is  a  palindrome.” 

The  palindrome  recognizer  pal  communicates  with  the  environment  through 
the  input  port  in  and  the  output  port  out :  For  each  character  received  on  in, 
the  boolean  answer  is  output  on  out  whose  value  is  “the  sequence  of  characters 
received  on  port  in  so  far  is  a  palindrome.” 

The  process  pal  is  a  linear  array  of  M  elementary  processes  p(i  :  0..M  —  1) 
of  type  cell,  with  M  —  [y|. 

cell  =  process (inlchar,  orit\boolean,put\char,  getlboolean ) 
varx,y  :  char,z  :  boolean 
zn?£;out!true;z  :=  true; 

*[inly;out\((x  =  y)  A  z);put.\y,  get? z] 

end 

pal  =  process  (inlchar,  out'.boolean) 
p{i  ;  0 ..M  —  1)  :  cell 
{||i :  0..M  —  1  :  p(i)) 

chan(t :  0..M  -  2  :  (p(i).put.p(i  +  l).m)) 
chan(i :  0..M  —  2  :  (p(i).get.p(i  +  l).out)) 
p(0).m  =  pal.in 
p(0).out  =  pal. out 

end 

The  structure  of  the  cell  process  can  be  simplified  for  the  “bottom”  process 
p{M  —  1).  The  cell  process  can  also  be  improved  by  introducing  concurrency 
between  communications. 


2.8.5  Distributed  Mutual  Exclusion  on  a  Ring  of  Pro¬ 
cesses 

An  arbitrary  number  (>  1)  of  cyclic  automata,  called  “masters,”  make  inde¬ 
pendent  requests  for  exclusive  access  to  a  shared  resource.  The  circuit  should 
handle  the  requests  from  the  masters  in  such  a  way  that 

1.  Any  request  is  eventually  granted,  and 

2.  there  is  at  most  one  master  using  the  shared  resource  at  any  time. 

The  masters  are  independent  of  each  other:  They  do  not  communicate 
with  each  other,  and  the  activity  of  a  master  not  using  the  resource  should 
not  influence  the  activity  of  other  masters. 

A  master,  M,  communicates  with  its  private  server,  m.  When  M  wants 
to  use  the  shared  resource  (M  is  said  to  be  a  candidate ),  it  issues  a  request  to 
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m.  When  the  request  is  accepted,  M  uses  that  resource  (for  a  finite  period  of 
time),  and  then  informs  m  that  the  resource  is  free  again. 

The  servers  are  connected  in  a  ring.  At  any  time,  exactly  one  (arbitrary) 
server  holds  a  “privilege.”  Only  the  “privileged  server”  may  grant  the  re¬ 
source  to  its  master  and  thereby  guarantee  mutual  exclusion  on  the  access 
to  the  resource.  A  non-privileged  server  transmits  a  request  from  its  master 
(or  from  its  left-hand  neighbor)  to  its  right-hand  neighbor.  A  request  cir¬ 
culates  to  the  right  (clockwise)  until  it  reaches  a  server  whose  master  is  a 
candidate  (this  server  ignores  the  request  until  it  has  served  its  master)  or 
reaches  the  privileged  server.  The  privileged  server  reflects  the  privilege  to 
the  left  (counter-clockwise)  until  it  reaches  the  server  that  generated  the  re¬ 
quest.  This  server  then  becomes  privileged,  and  may  grant  the  resource  to  its 
master.  The  strategy  of  passing  requests  clockwise  and  reflecting  the  privilege 
counterclockwise  has  two  important  advantages:  First,  no  boolean  message 
need  actually  be  transmitted;  second,  no  message  need  be  reflected,  as  the 
completion  of  a  pending  request  is  interpreted  as  passing  the  privilege. 

master  =  *[. . .  D\ CS',  D] 
server  =  *[  \U  — *  [6  — >  skip§->b  —*■  i?];  17;  U;  b  T 

ji  — »  [6  — >  sAhp|->fr  — +  R]; L\b | 

]]■ 

The  boolean  b  is  used  to  encode  the  presence  of  the  privilege.  The  non- 
deterministic  bar  indicates  that  both  guards  may  be  true  at  the  same  time, 
and  therefore  arbitration  has  to  take  place.  We  can  describe  a  system  in  which 
n  servers  are  connected  in  a  ring  by  first,  defining  a  process  pair  consisting  of 
a  master  and  a  server,  and  then  connecting  n  pairs  in  a  ring: 


pair  =  process(T,  R) 
m  :  server 
M  :  master 

M|m) 

chan(?n.  U,M.D) 
end 

ring  =  process 

p(i  :  0..n  —  1)  :  pair 
(|)i  :  0..n  —  1  \p(i)) 

chan(i :  0..n  -  1  :  (p(*).iJ,p((i  +  l)modn).A)) 

end 


(For  a  complete  description  and  proof  of  correctness,  see  [7].) 
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2.8.6  An  Asynchronous  Microprocessor 

We  will  describe  the  design  of  an  asynchronous  microprocessor  in  Chapter  9. 
In  this  section,  we  briefly  explain  how  the  concurrent  program  for  the  pro¬ 
cessor  was  derived  from  a  sequential  version  by  semantics-preserving  trans¬ 
formations.  We  do  not  show  the  complete  derivation  but  only  the  first  few 
steps. 

The  processor  is  first  described  as  a  sequential  program,  which  is  then 
transformed  into  a  set  of  concurrent  processes  so  as  to  increase  the  concurrency 
in  the  execution  of  a  sequence  of  instructions  by  pipelining.  The  sequential 
program  is  a  non-terminating  loop,  each  step  of  which  is  a  FETCH  phase 
followed  by  an  EXECUTE  phase. 


*[FETCH  :  i,pc  :=  imem\pc],pc  +  1; 

[off(i)  — *  offset, pc  :=  imem\pc],pc  +  1; 
l-iojf(i)  — ►  skip 


j! 

EXECUTE : 


]• 


[afu(t)  -*  (reg[i.z],f)  := 

aluf{reg[i.x\,reg[i.y] , i.op,  f ) 
| ld(i)  — *  reg[i.z ]  :=  dmem[reg[i.x]  +  reg[i.y\] 
Jst(f)  — ♦  dmem[reg[i.x]  +  reg[i.y]]  reg[i.z\ 

[ ldx{i)  — »  reg[i.z]  :=  dmem[offset  +  reg]f.y\\ 
jste(i)  — *  dmem[offset  +  re^.y]]  :=  reg[i.z\ 
— »  reg[i.z]  :=  offset  +  reg[i.y] 

Jstpc(i)  — »  reg[i.z\  pc 
\jmp(i)  -*  pc  :=  reg[i.y\ 

|&rch(j)  — *  \cond(f,i.cc)  — >  pc  :=  pc- f  offset 
|-i cond(f,i.cc)  —>  skip 
] 

} 


The  variables  of  the  program  are  the  following:  As  we  already  mentioned, 
variable  i  contains  the  instruction  currently  being  executed.  All  instructions 
contain  an  op  field  describing  the  opcode.  The  parameter  fields  depend  on 
the  types  of  the  instructions.  The  most  common  ones,  those  for  ALU,  load, 
and  store  instructions,  consist  of  the  three  parameters  x,  y ,  and  z.  Variable 
cc  contains  the  condition  code  field  of  the  branch  instruction,  and  /  contains 
the  flags  generated  by  the  execution  of  an  alu  instruction. 

The  two  memories  are  described  as  the  arrays  imem  and  dmem.  The 
index  to  imem.  is  the  program  counter  variable  pc.  Variable  offset  contains 
the  offset  field  that  extends  certain  instructions  to  the  following  word.  The 


2.8.  EXAMPLES 


25 


general-purpose  registers  are  described  as  the  array  Teg\ 0...15].  Register 
reg[ 0]  is  special:  It  always  contains  the  value  zero. 

The  function  evaluation  ( z ,  /)  :=  aluf(x,y,op,f )  evaluates  an  ah i  instruc¬ 
tion  with  the  opcode,  op;  parameters  x  and  y;  and  the  current  value  of  the 
flags,  /.  The  result  is  an  integer,  z ,  and  a  new  value  of  the  flags,  /.  The 
function,  aluf,  is  not  described  in  the  program.  The  boolean  functions  used 
in  guards  all  determine  certain  properties  of  the  current  instruction  i  and  are 
assumed  to  be  self-explanatory. 


2.8.7  First  Decomposition  into  Concurrent  Processes 

The  first  step  of  the  decomposition  consists  in  replacing  the  previous  program 
with  the  program: 

*{FETCH;E1H;E2]  ||  *{El?i;  EXECUTE;  E2\ 

We  leave  it  as  an  exercise  to  the  readers  to  convince  themselves  that  this 
decomposition  does  not  introduce  concurrency.  The  concurrent  program  is 
strictly  equivalent  to  the  sequential  one. 

Concurrent  activity  between  the  two  processes  will  be  introduced  by  mov¬ 
ing  E2  forward  in  the  code  of  EXECUTE  so  that  the  n  +  lrst  iteration  of 
FETCH  can  start  before  the  nth  iteration  of  EXECUTE  is  finished.  This  re¬ 
finement,  and  the  further  decomposition  of  EXECUTE  into  several  processes 
is  not  discussed  here.  The  resulting  program  can  be  found  in  Chapter9. 

The  rest  of  the  exercise  will  concentrate  on  the  further  decomposition  of 
FETCH .  The  practical  way  to  exploit  concurrency  in  FETCH  is  through  the 
implementation  of  the  multiple  assignments.  We  introduce  a  process  for  the 
instruction  memory  which  communicates  the  next  instruction  at  address  pc 
by  a  communication  action  on  channel  ID.  Observe  that  variable  pc  is  shared 
by  the  two  processes.  We  get  the  following  program: 

IMEM  =  *[I  DHmem\pc]] 

FETCH  =  *[  {IDli  ||  y  pc  A  l);pc  :=  y; 

[off{i)  -*  {ID! offset  j|  y  pc+l);pc  :=  y 

1  ->off(i)  ->  skip 
];  EV.i;  E2 

] 

EXEC  =  *\Elli;  EXECUTE;  E2] 

Next,  we  delegate  the  execution  of  the  assignments  y  pc  +  l;pc  :=  y  to  a 
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separate  process  as  follows: 

FETCH  =  *[PCIl;ID?i;  PCI2; 

[off(i)  -+  PC II;  I DI offset;  PC  12 

h  off(i)  —  skip 

};El\i;E2 

] 

PCADD  =  *{PCI1;  y  :=  pc  +  1;  PCI2\pc  :=  y\ 

(The  reader  worrying  about  the  cost  of  these  extra  communications  has  to 
realize  that  the  two  pairs  of  communications  El  and  E 2,  and  PC  1  and  PC 2 
are  each  implemented  as  the  two  halves  of  the  same  communication  action.) 


Chapter  3 


The  Object  Code, 
Production  Rules 

3.1  Introduction 

Carrying  the  discrete  model  of  computation  down  to  the  transistor  level  re¬ 
quires  that  the  MOS  transistor  be  idealized  as  an  on/off  switch.  Unfortu¬ 
nately,  the  simple  semantics  of  the  switch  ignore  too  many  electrical  phenom¬ 
ena  that  play  an  important  role  in  the  functioning  of  the  circuit.  A  crucial 
innovation  of  the  method  is  that  the  transistor  need  not  be  viewed  as  a  dis¬ 
crete  switch;  voltages  can  change  in  a  continuous  way  from  one  stable  level 
to  the  other  one,  provided  that  the  changes  are  monotonic. 

The  notation  for  the  object  code  provides  the  weakest  possible  form  of 
control  structure  and  the  smallest  number  of  program  constructs.  In  fact, 
it  contains  exactly  one  construct,  the  production  rule  (PR),  and  one  control 
structure,  the  production  rule  set. 

We  consider  the  production  rule  notation  to  be  the  canonical  representa¬ 
tion  of  a  digital  circuit.  This  representation  can  be  decomposed  into  several 
equivalent  networks  of  digital  operators,  depending  on  the  set  of  building 
blocks  used,  or  even  depending  on  the  technology  (e.g.,  CMOS  or  GaAs) 
used,  but  the  production-rule  set  represents  the  circuit  independently  of  the 
chosen  physical  implementation. 

3.1.1  Definitions 

Production  Rule.  A  production  rule  (PR)  is  a  construct  of  the  form  G  i->  5, 
where  S  is  either  a  simple  assignment  or  an  un ordered  list,  “si,  s 2,  s3,  ...  ” 
of  simple  assignments,  and  G  is  a  boolean  expression  called  the  guard  of  the 
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PR. 

Example: 

x  Ay  zj 
-ix  1-4  u  1 ,  v  | 

The  semantics  of  a  PR  are  defined  only  if  the  PR  is  stable: 

Stability.  A  PR  G  *—>  S  is  said  to  be  stable  in  a  given  computation,  if,  at 
any  point  of  the  computation,  G  either  is  false  or  remains  invariantly  true 
until  the  completion  of  S. 

Stability  is  not  guaranteed  by  the  implementation.  It  has  to  be  enforced 
by  the  compilation  procedure. 

Execution  of  a  PR.  An  execution  of  the  stable  PR  G  1-4  S  is  an  unbounded 
sequence  of  firings.  A  tiring  ofGv-+S  with  G  true  amounts  to  the  execution 
of  S.  A  firing  of  G  i—>  S  with  G  false  amounts  to  a  skip. 

If  5  is  a  list  of  several  simple  assignments,  the  execution  of  S  is  the  con¬ 
current  execution  of  all  asssignments  of  the  list. 

Production  Rule  Set.  A  PR  set  is  the  concurrent  composition  of  all  PRs 
of  the  set. 

For  example,  a  directed  wire  with  input  x  and  output  y  is  represented  by, 
or,  perhaps  more  precisely,  is  the  implementation  of  the  production  rule  set 

XH4  J/T 
-ix  i— >  y  1 

The  only  composition  operation  on  two  PR  sets  is  the  set  union. 

Theorem.  The  implementation  of  two  concurrent  processes  is  the  set  union 
of  the  two  PR  sets  implementing  the  processes  and  of  the  PR  sets  implement¬ 
ing  the  channels  between  the  processes,  if  any. 

The  proof  follows  from  the  associativity  of  the  concurrent  composition 
operator.  The  other  operations  on  the  PRs  of  a  set  are  those  allowed  by  the 
following  properties: 

•  Multiple  occurrences  of  the  same  PR  are  equivalent  to  one  as  a  conse¬ 
quence  of  the  idempotence  of  the  concurrent  composition. 

•  The  two  rules  G  i->  SI  and  G  >— ►  S2  are  equivalent  to  the  single  rule 
G  i->  51,52. 

•  The  two  rules  G 1  1-4  S  and  G2  s— >  S  are  equivalent  to  the  single  rule 
G1VG2hS. 

PRs  are  complementary  when  they  are  of  the  type  G 1  hi|  and  G  2  hi|. 
We  require  that  complementary  PRs  be  non-interfering. 
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Non-Interference.  Two  complementary  PRs  are  non-interfering  when  -if? IV 
-iG2  holds  invariantly. 

It  can  be  proven  that,  under  the  stability  of  each  PR  and  non-interference 
among  complementary  PRs,  the  concurrent  execution  of  the  PRs  of  a  set  is 
equivalent  to  the  following  sequential  execution: 

^[select  a  PR  with  a  true  guard;  fire  the  PR] 

where  the  selection  is  weakly  fair  (each  PR  is  selected  infinitely  often).  From 
now  on,  we  ignore  the  firings  of  a  PR  with  a  false  guard;  a  firing  will  mean 
a  firing  of  a  PR  with  a  true  guard. 

Hence,  any  valid  execution  of  a  production-rule  set  in  which  non-interference 
and  stability  are  fulfilled  is  equivalent  to  a  non-deterministic  sequential  exe¬ 
cution  of  the  production-rule  set.  This  equivalence  facilitates  the  analysis  of 
production-rule  sets. 

Until  we  return  to  these  issues,  we  shall  assume  that  the  stability  and 
non-interference  requirements  are  fulfilled. 


3.2  VLSI  implementation  of  PRs 

Stability  and  non-interference  are  the  two  properties  that  make  the  VLSI 
implementation  of  PRs  (almost)  straightforward.  As  an  example,  we  describe 
a  simple  implementation  of  PRs  in  CMOS  technology. 

3.2.1  The  CMOS  transistors 

A  CMOS  circuit  is  a  network  of  “nodes” — variables — interconnected  by  tran¬ 
sistors.  Certain  nodes  are  also  connected  to  the  input-output  “pads”,  which 
provide  the  interface  with  the  environment — we  will  ignore  the  pads  in  this 
presentation.  Other  nodes  are  directly  connected  to  the  power  node,  pro¬ 
viding  the  constant  high-voltage  value — called  VDD — which  represents  the 
logical  constant  true  or  1.  Yet  other  nodes  are  directly  connected  to  the 
ground  node — called  GND — providing  the  constant  low-voltage  value  which 
represents  the  logical  constant  false  or  0. 

A  node  takes  the  continuous  range  of  voltage  values  between  the  high 
voltage  and  the  low  voltage.  Above  a  certain  voltage  ul  the  value  is  interpreted 
as  1.  Below  another  voltage  u0,  the  value  is  interpreted  as  0.  Thanks  to  the 
stability  property,  the  precise  values  of  v\  and  uO,  which  vary  from  node 
to  node,  are  irrelevant  provided  that  uO  <  ul  and  the  voltage  changes  are 
monotonic. 

(Strict  monotonicity  is  not  necessary,  and  is  actually  impossible  to  achieve 
because  of  noise,  but  we  will  not  enter  into  these  details  here.) 
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A  CMOS  transistor  is  either  of  n-type  or  p-type.  A  transistor  relates  three 
nodes  in  the  following  way.  Let  g,  standing  for  “gate”,  and  x  and  y  be  the 
three  nodes.  When  g  is  false  for  an  n-transistor,  and  true  for  a  p-transistor, 
no  current  passes  through  the  the  region  between  x  and  y,  called  the  channel x; 
thus  x  and  y  are  left  unchanged.  When  g  is  set  to  true  for  an  n-transistor, 
or  false  for  a  p-transistor,  the  channel  becomes  conducting.  In  this  case,  x 
and  y  either  have  the  same  voltages  and  are  left  unchanged,  or  a  current  is 
established  in  the  channel  until  x  and  y  reach  the  same  voltage.  The  common 
value  reached  by  x  and  y  depends  on  electrical  properties  of  x  and  y  that  are 
determined  by  the  physical  sizes  (capacitances)  of  the  nodes  implementing  x 
and  y  and  by  their  interactions  with  the  rest  of  the  circuit.  (Differences  in  node 
capacitances  may  cause  charges  to  flow  through  the  channel  of  a  transistor 
in  a  way  that  results  in  unintended  values  of  the  nodes.  This  phenomenon, 
called  charge  sharing,  may  make  it  quite  difficult  to  predict  the  final  voltage 
value  reached  by  x  and  y.) 

In  order  to  define  the  net-effect  of  a  PR  independently  of  the  physical 
parameters  of  its  implementation,  we  are  going  to  restrict  the  use  of  transis¬ 
tors.  (In  particular,  the  restriction  will  eliminate  most  occurrences  of  charge 
sharing.) 

We  impose  the  condition  that  a  transistor  used  in  isolation  connect  only 
two  variables  of  the  circuit:  the  gate  g  and  one  of  the  other  two  nodes,  say  a. 
The  third  node  of  the  transistor  is  either  the  power  or  the  ground.  With  this 
restriction,  the  behavior  of  a  single  n-transistor  is 

p  i-+  z  t  or  g  z  l  . 

The  behavior  of  a  single  p-transistor  is 

->g  t-»  z  |  or  ~->g  i~»  z  | . 

3.2.2  Threshold  voltages 

The  current  in  the  channel  of  a  transistor  is  a  function  of  the  so-called  gate- 
to-source  voltage,  Vg3,  defined  as  V (g)  —  min{V(x),  T(p))  for  an  n-transistor, 
and  as  V ( g )  —  max(V{x),V(y))  for  a  p-transistor.  In  first  approximation,  the 
current  is  assumed  to  be  zero  when 

Vg,  <  Vfn 

for  an  n-transistor,  and 

>  Vtp 

1This  notion  of  channel  is  unrelated  to  the  one  we  introduced  for  communication  among 
processes. 
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for  a  ^-transistor.  Vjn  and  Vtv  are  called  the  threshold  voltages.  (Typically, 
Vtn  «  IV”  and  Vtp  «  -IV.) 

Because  of  the  existence  of  threshold  voltages,  if  an  n-transistor  is  used  to 
implement  g  t-*  z  (,  the  final  value  of  z  is  not  a  “strong’’  1,  since  the  channel 
will  stop  conducting  as  soon  as  the  voltage  of  z  is  within  Vtn  of  the  gate 
voltage.  And  symmetrically,  a  p- transistor  used  to  implement  ->g  i~*  z  J.  does 
not  produce  a  “strong”  zero  as  final  value  of  z.  Since  the  voltage  drops  caused 
by  the  threshold  voltages  accumulate  as  we  compose  operators,  it  is  important 
to  produce  strong  signals  in  order  to  be  able  to  compose  an  arbitrary  number 
of  operators.  We  shall  therefore  restrict  our  use  of  n-transistors  to  PRs  of  the 
form 


z  [  (3.1) 

and  p-transistors  to  production  rules  of  the  form 

-'g  Z  t  •  (3.2) 

With  these  restrictions,  all  implementations  produce  strong  signals. 

Threshold  voltages  are  difficult  to  adjust  in  CMOS  technology.  Actually, 
they  tend  to  become  more  variable  as  the  feature  size  decreases.  (They  may 
also  vary  during  the  activity  of  the  circuit  because  of  some  electrical  interac¬ 
tion  with  the  substrate,  called  body  effect.)  For  constant  node  capacitance, 
variations  in  thresholds  are  accountable  for  most  of  the  discrepancies  in  prop¬ 
agation  delays  on  a  CMOS  chip.  In  particular,  these  variations  exclude  the 
possibility  that  the  ordering  in  space  of  a  set  of  variables  along  a  common  wire 

be  used  to  infer  an  ordering  in  time  of  a  set  of  transitions  of  these  variables. 


3.2.3  Switching  circuits 

Consider  the  canonical  (stable)  PR 

hi-»  z[ 

where  6  is  a  boolean  expression  in  terms  of  a  set  of  variables.  These  variables 
are  used  as  gates  of  transistors  implementing  a  switching  circuit  s  correspond¬ 
ing  to  b:  s  is  a  series-parallel  switching  circuit  between  the  ground  node  (also 
called  GND)  and  z.  GND  has  the  constant  value  false  .  The  other  constant 
node,  the  power-node  VDD ,  has  the  constant  value  true  . 

The  switches  are  n-transistors  whose  gates  are  the  variables  of  i,  possibly 
negated.  Furthermore,  we  have: 

b  =  “there  is  a  path  from  ground  to  z  in  s" 
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By  construction  of  s,  if  6  holds  and  remains  stable,  z  is  eventually  set  to  false 
.  (For  this  reason,  s  is  called  a  pull-down  circuit.)  Hence,  s  is  exactly  the 
implementation  of  the  production  rule  b  i-+  2  [. 

Using  a  symmetrical  argument,  we  can  show  that  the  same  series-parallel 
circuit  as  s,  but  with  VDD  and  2  connected,  and  whose  switches  are  p- 
transistors,  implements  the  production-rule: 

bneg  t-+  z  T  , 

where  bneg  is  derived  from  b  by  negating  all  variables.  (This  circuit  is  called 
a  pull-up  circuit.) 


3.3  Operators 

The  two  PRs  that  set  and  reset  the  same  variable,  like 

61 1 — *  z  ( 

62 1 — >  z  l  , 


(3.3) 


are  implemented  as  one  operator. 

Let  si  be  the  pull-up  circuit  corresponding  to  61,  and  let  s2  be  the  pull¬ 
down  circuit  corresponding  to  62.  The  two  circuits  are  connected  through  the 
common  node  z.  Since  non-interference  has  been  enforced,  — >61  V  ->b2  holds  at 
any  time.  This  guarantees  the  absence  of  a  conducting  path  between  power 
and  ground  when  the  operator  is  not  firing.  (A  path  may  exist  for  a  short 
time  when  the  operator  is  firing.) 

Definition.  The  operator  implementing  the  two  rules  is  called  “combina¬ 
tional”  if  61  V  62  holds  at  any  time,  and  “ state-holding ”  otherwise. 

By  definition,  if  the  operator  is  combinational,  there  is  always  a  conducting 
path  between  either  VDD  or  GND  and  the  output  z.  Hence,  the  value  of  the 
output  is  always  a  strong  false  value  or  a  strong  true  value,  and  therefore  the 
circuit  corresponding  to  the  composition  of  si  and  s2  is  a  valid  implementation 
of  the  operator. 

For  example,  PRs  3.1  and  3.2  together  implement  an  inverter.  The  circuit 
of  Figure  3.2  implements  the  nand-operator  defined  by  the  PRs 

a  A  61-*  z  l 
->a  V  -i  6  z  ( 

If  3.3  is  a  state-holding  operator,  ->61  A  ^62  may  hold  in  a  certain  state.  In 
such  a  state,  node  z  is  isolated;  there  is  no  path  between  2  and  either  VDD  or 
GND.  In  MOS  technology,  an  isolated  node  does  not  retain  its  value  forever; 
eventually  the  charges  leak  away  through  the  substrate  and  also  through  the 
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Figure  3.1:  CMOS  implementation  of  a  combinational  operator 


transistors  of  the  pull-up  and  pull-down  circuits.  If  the  PRs  of  the  operator  are 
fired  frequently  enough  to  prevent  leakage,  the  implementation  of  Figure  3.1 
can  be  used  for  a  state-holding  operator.  Such  an  implementation  is  called 
dynamic. 

Otherwise,  it  is  necessary  to  add  a  storage  element  to  the  output  node 
of  a  state-holding  operator.  Such  an  implementation  is  called  static.  In  the 
sequel,  we  assume  that  only  static  implementations  are  used  for  state-holding 
operators. 

A  standard  CMOS  implementation  of  such  a  storage  element  consists  of 
two  cross-coupled  inverters  (see  Figure  3.3).  This  implementation  inverts  the 
value  of  z. 

The  “weak”  inverter,  marked  with  a  letter  w  on  the  figure,  connects  z 
to  either  VDD  or  GND  through  a  high  resistance,  so  as  to  maintain  z  at  its 
intended  voltage  value  [22]. 

The  implementation  of  a  static  state-holding  operator  is  slightly  more 
costly  than  that  of  a  combinational  operator  because  of  the  need  for  a  storage 
device.  Hence,  given  a  pair  of  PRs  that  are  not  combinational,  we  may  first 
try  to  modify  the  guards — under  the  invariance  of  the  semantics — so  as  to 
make  them  combinational. 

3.3.1  The  Standard  Operators 

All  operators  of  one  or  two  inputs  are  used,  and  are  therefore  viewed  as  the 
standard  operators. 
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Figure  3.2:  CMOS  implementation  of  a  NAND-operator 


One-Input  Operators 

The  two  operators  with  one  input  and  one  output  are  the  wire: 

x  wy  =  m  i — 2/  f 
i->  V  i  » 


and  the  inverter. 


-ix  w_y  =  -ix  y—>y  f 

2/1  . 


Most  operators  we  use  have  more  inputs  than  outputs.  But,  in  general, 
the  components  we  design  have  as  many  outputs  as  inputs.  Hence,  we  need 
to  reset  the  balance  by  introducing  at  least  one  operator,  the  fork,  with  more 
outputs  than  inputs.  A  fork  with  two  outputs  is  defined  as: 

V  i,zl  . 

The  wire  and  the  fork  are  the  only  two  operators  that  are  not  implemented 
as  a  pull-up/pull-down  circuit — called  a  restoring  circuit — but  as  a  simple 
conducting  interconnection  between  input  and  outputs. 
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Figure  3.3:  A  static  implementation  of  a  state-holding  operator 


The  Wire  as  a  Renaming  Operator 

Because  the  implementation  of  a  wire  is  the  same  as  that  of  a  node,  the  wire 
behaves  as  a  renaming  operator  when  composed  with  another  operator:  The 
composition  of  an  arbitrary  operator  0  with  output  variable  x  with  the  wire 
xwy  is  equivalent  to  0  in  which  x  is  renamed  y.  The  composition  of  operator 
0  with  input  variable  x  with  the  wire  y  w_  x  is  equivalent  to  0  in  which  x  is 
renamed  y.  (Observe  that  O  can  even  be  a  wire.) 

Unfortunately,  the  fork  is  not  a  renaming  operator  since  the  concurrent 
assignments  to  the  different  outputs  of  the  fork  are  not  completed  simulta¬ 
neously.  In  order  to  use  a  fork  as  a  renaming  operator,  we  will  later  have  to 
make  the  timing  assumption  that  such  a  fork  is  isochronic. 


Combinational  Operators  with  Two  Inputs 

We  construct  all  functions  B  of  two  variables  x  and  y  such  that 


B  z  | 
<B  t~*  z  l  . 


36 


CHAPTER  3.  THE  OBJECT  CODE,  PRODUCTION  RULES 


We  get  for  B:  xAy,  xVy,  and  x  =  y.  We  will  not  list  the  functions  obtained  by 
inverting  inputs  of  B.  (On  the  figures,  a  negated  input  or  output  is  represented 
by  a  small  circle  on  the  corresponding  line.)  This  gives  the  following  set. 

The  and,  with  the  infix  notation  ( x,y )  A  z,  is  defined  as: 

x  A  y  (->  z  T 
V  -iy  2  J,  . 

The  or,  with  the  infix  notation  (x,  y)  V.2,  is  defined  as: 

x  V  y  i-t  z  t 
-ix  A  ~>y  t->  z  l  . 

The  equality,  with  the  infix  notation  (x,y)  eq_z,  is  defined  as: 

x  =  y  m  z  | 
x  ±  y  t->  z  \  . 


State-Holding  Operators  with  Two  Inputs 

Next,  we  construct  all  different  two-input-one-output  operators  of  the  form 

61  h+  z  \ 

62t-+  z  l 


such  that  -i61  V  -i62  holds  at  any  time,  but  bl  ^  — >62.  We  select  for  61  either 
x  A  y,  or  x  V  y,  or  x  =  y.  For  each  choice  of  61,  we  construct  62  as  any  of  the 
effective  strengthenings  of  ->61. 

For  61  =  (x  A  y),  we  get  for  62:  -\x  A  -iy,  ->x  A  y,  -ix,  and  x  y.  The  first 
three  choices  of  62  lead  to  the  following  state-holding  operators: 

The  C- element 

(x,y)C_z=  xAy^z] 

-<x  A  -iy  i— >  z  l  . 


(The  C-element  was  introduced  by  David  Muller,  and  described  in  [17].) 
The  switch 

(x,y)  sw_  z=  xAyt-tzl 
-ix  A  y  i— >  z  |  . 


The  asymmetric  C-element 


{x,y)  aC_  z=  iAj/  m  z  | 
—ix  i — ►  z  J. 


For  62  =  (x  t 6  y),  we  get  the  operator 


iAjhz] 
x^y>—  z[ 
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But,  if  the  stability  condition  is  fulfilled,  this  operator  is  not  state-holding. 
Because  of  the  stability  requirement,  the  state  in  which  -ix  A  -i y  holds — the 
“storage  state” — can  only  be  reached  from  states  a:  A  -i y  and  -ix  A  y.  In  both 
states,  -t2  holds,  and,  therefore,  ->z  holds  in  the  storage  state.  Hence,  we  can 
weaken  the  guard  of  the  second  PR  as  (x  ^  y)  V  (-is  A  -i y),  i.e.,  -is  V  -i y. 
Hence,  the  operator  is  equivalent  to  the  and-operator  (x,  y)  A  2. 

For  61  =  (x  V  y),  no  effective  strengthening  of  —>61  is  possible. 

For  61  =  (s  =  y),  we  get  the  operator: 

x  —  y  z  f 
x  A  -iy  i— ►  2  |  . 

But  if  the  stability  condition  is  fulfilled,  this  operator  is  not  state- holding 
for  the  same  reasons  that  the  operator  with  bl  =  x  Ay  and  62  =  (x  ^  y )  is 
not. 

Flip-Flop 

The  canonical  form  we  choose  for  the  flip-flop  is  : 

(X,y)  ff_  Z=  X  1-4  2  | 

-‘V'r+z  i  , 

which  requires  the  invariance  of  ->x  V  y  to  satisfy  non-interference.  Observe 
that  the  flip-flop  (x,  y)ffz  can  always  be  replaced  with  the  C-element  (x,  y)C_z 
but  not  vice  versa. 

3.3.2  Multi-Input  Operators 

We  use  n-input  and ,  or,  C-element,  whose  definitions  are  straightforward.  We 
use  a  multi-input  flip-flop  defined  as: 

( xi . xk,  2/i,  z=  V  *  :  x<  ^  z  t 

V  i :  -'Vi  z  I 

where  (Vi  :  ->Xi)  V  (Vi  :  jfi). 

We  also  use  the  combinational  i/-operator — sometimes  called  multiplexer — 
defined  as: 

(. x,y,z)if_u=  (x  Ay)  V  (->x  A  z)  u] 

(x  A  -iy)  V  (-ix  A  -iz )  i-»  u  l  . 

The  most  general  and  most  often  used  operator  is  the  generalized  C- 
element ,  of  which  all  other  forms  of  C-elements  are  a  special  case.  It  im¬ 
plements  a  pair  of  PRs 

jB  1 1 — >  x  f 
B2 1— >  x  | 
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in  which  B 1  and  B  2  are  arbitrary  conjunctions  of  elementary  terms.  (As 
usual,  the  two  guards  have  to  be  mutually  exclusive.)  For  example: 

a  A  6  A  -ici-t 
-in  A  d>->  x  l 

can  be  directly  implemented  with  a  generalized  C-element.  Observe  that  the 
limiting  factor  for  the  size  of  the  guards  is  not  the  number  of  inputs,  but  the 
number  of  terms  in  a  conjunction. 


3.4  Arbiter  and  Synchronizer 

So  far,  we  have  considered  only  PR  sets  in  which  all  guards  are  stable  and 
non-interfering.  But  we  shall  have  to  implement  sets  of  guarded  commands — 
selections  or  repetitions — in  which  the  guards  are  not  mutually  exclusive,  as 
in  the  probe  selection  example.  Therefore,  we  need  at  least  one  operator  that 
provides  a  non-deterministic  choice  between  two  true  guards. 

3.4.1  Arbiter 

The  simplest  selection  between  non-exclusive  guards  is  of  the  form 

*[  l®  — *■ ' ' ' 

Iff-*"’ 

]] 

where  x  and  y  are  simple  boolean  variables,  and  the  two  guards  are  stable.  In 
order  to  distinguish  among  the  three  basic  states  of  the  system — i.e.,  neither 
x  nor  y  is  selected,  x  is  selected,  or  y  is  selected— -we  need  to  introduce  two 
outputs,  say,  u  and  v,  as  follows: 

*[[sc  — >  u  t;  ■  ■  ■ 

Is/  —  v  T; 

]). 

Initially,  -m  A  ~<v  holds  as  coding  of  the  state  “no  selection  made”.  Hence, 
when  the  selection  is  considered  completed,  which  is  just  a  matter  of  definition, 
u  and  v  should  be  set  back  to  false.  We  get 

[— ix];  u  J, 

lb  -*■  v  i;  Hd;  v  I  (3.4) 

]]■ 

If  ->u  A  -iv  holds  initially,  — itt  V  -*v  holds  at  any  time. 

The  above  program  is  a  description  of  the  operator  known  as  the  “basic 
arbiter”  or  “mutual  exclusion  element,”  denoted  as  (x,y)  orfe  (u,  u).  Observe 
that  the  choice  between  the  two  guards  is  not  fair. 
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3.4.2  Synchronizer 

When  negated  probes  are  used,  for  instance  to  implement,  fairness,  we  have 
to  implement  selection  commands  with  unstable  guards  .  The  synchronizer  is 
the  only  operator  that  accepts  non-stable  guards.  It  is  defined  as 

*[  [6  A  z  — ♦  u  T;  [-12];  u  i 
jj-i&  A  z -»  v  T;  [12];  v  I  (3.5) 

]]■ 

Variable  b  may  change  at  any  time  from  false  to  true  .  But  both  b 
and  z  remain  true  until  u  or  v  has  changed.  Hence,  the  guard  ~>b  A  2  is 
unstable,  whereas  the  guard  6  A  2  is  stable.  As  in  the  arbiter  case,  if  ~m  A  ->v 
holds  initially,  -<u  V  ->v  holds  at  any  time.  (The  synchronizer  operator  was 
introduced  in  [9].) 

3.4.3  Implementation  and  Metastability 

Let  us  first  consider  the  PR  sets  for  3.4  and  3.5  that  contain  unstable  rules. 
The  PR  set  for  the  “unstable  arbiter”  is 

x  A  -<v  u  t 
y  A  "TU  *-*  V  t 

-ry  V  U  H-*  V  l  . 

The  PR  set  for  the  “unstable  synchronizer”  is 

b  A  z  A-ivi-*  ul 
-i!iAzA-iiihd| 

->z  V  v  u  l 
-12  V  u  t-f  v  |  , 

The  first  two  PRs  of  the  arbiter  are  unstable  and  can  fire  concurrently.  The 
same  holds  for  the  first  two  production  rules  of  the  synchronizer:  since  b  can 
change  from  false  to  true  at  any  time,  both  guards  may  evaluate  to  true  . 

Let  us  analyze  the  PR  set  implementation  of  the  arbiter.  The  synchronizer 
case  is  very  similar.  The  state  xAyA(u  =  n)of  the  arbiter  is  called  metastable. 
When  started  in  the  metastable  state,  with  -iuA->v,  the  set  of  PRs  specifying 
the  arbiter  may  produce  the  unbounded  sequence  of  firings: 

*[(uT,ut);(u  4.,  v  i)] 

In  the  implementation,  nodes  u  and  v  may  stabilize  to  a  common  intermediate 
voltage  value  for  an  unbounded  period  of  time.  Eventually,  the  inherent 
asymmetry  of  the  physical  realization  (impurities,  fabrication  flaws,  thermal 
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noise,  etc.)  will  force  the  system  into  one  of  the  two  stable  states  where  u^v. 
But  there  is  no  upper  bound  on  the  time  the  metastable  state  will  last,  which 
means  that  it  is  impossible  to  include  an  arbitration  device  into  a  clocked 
system  with  absolute  certainty  that  a  timing  failure  cannot  occur. 

The  spurious  values  of  u  and  v  produced  during  the  metastable  state  must 
be  eliminated  since  they  are  not  stable  and  violate  the  requirement  ->tt  V  -<v. 
Hence,  we  compose  the  “bare”  arbiter  with  a  “filter”  taking  u  and  v  as  input 
and  producing  w/and  vf  as  “filtered  outputs”.  The  net-effect  of  the  filter  is: 

uf,  vf  :=  ( u  A  -<v ),  ( v  A  ~iu) 

(In  the  CMOS  construction  of  the  filter  shown  in  Figure  5,  we  use  the 
threshold  voltages  to  our  advantage:  The  channel  of  transistor  il  is  conducting 
only  when  (u  A  -<v)  holds,  and  the  channel  of  transistor  t2  is  conducting  only 
when  (v  A -in)  holds.) 


FILTEIi 


■uf 


vf 


Figure  3.4:  An  implementation  of  the  basic  arbiter 


In  delay-insensitive  design,  the  correct  functioning  of  a  circuit  containing 
an  arbiter  or  a  synchronizer  is  independent  of  the  duration  of  the  metastable 
state;  therefore,  relatively  simple  implementations  of  arbiters  and  synchroniz¬ 
ers  can  be  used.  In  synchronous  design,  however,  the  implementations  have 
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to  meet  the  additional  constraint  that  the  probability  of  the  metastable  state 
lasting  longer  than  the  clock  period  should  be  negligible. 
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Chapter  4 

The  Compilation  Method 


4.1  Introduction 

This  chapter  briefly  introduces  the  main  steps  of  the  compilation  procedure: 
process  decomposition,  handshaking  expansion,  and  production  rule  expan¬ 
sion. 


4.2  Process  Decomposition 

The  first  step  of  the  compilation,  called  process  decomposition,  consists  in 
replacing  one  process  with  several  processes  by  application  of  the  following 
Decomposition  rule:  A  process,  P,  containing  an  arbitrary  program  part, 
S,  is  semantically  equivalent  to  two  processes,  PI  and  P 2,  where  PI  is 
derived  from  P  by  replacing  S  with  a  communication  action,  C,  on  the 
newly  introduced  channel  (C,  D)  between  PI  and  P2,  and  P 2  is  the  process 
*[[D-*S;D]]. 

The  structure  of  P 2  will  be  used  so  frequently  that  we  introduce  an  opera¬ 
tor  to  denote  it:  the  call  operator.  We  denote  it  by  ( D/S ),  and  we  say  that  D 
calls  (or  activates)  S.  (We  will  later  generalize  the  implementation  of  the  call 
operator  so  that  the  implementation  mentioned  above  in  the  defintion  of  the 
decomposition  rule  is  just  a  particular  case  of  eth  general  implementation.) 

Observe  that  process  decomposition  does  not  introduce  concurrency.  Al¬ 
though  PI  and  P 2  are  potentially  concurrent,  they  are  never  active  concur¬ 
rently;  P2  is  activated  from  PI,  much  as  a  procedure  or  a  coroutine  would 
be.  The  newly  created  subprocesses  may  share  variables;  but,  since  the  sub¬ 
processes  are  never  active  concurrently,  there  is  no  conflicting  access  to  the 
shared  variables.  The  subprocesses  may  also  share  channels;  this  will  require 
a  special  implementation  for  such  channels.  Decomposition  is  applied  for  each 
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construct  of  the  language.  For  construct  5,  the  corresponding  process  P 2  can 
be  simplified  as  follows: 

•  If  S  is  the  selection  \B\  — >  5i|B2  — *  ^2] ,  P 2  is  simplified  as 

*{[D_AB1  -*  Si;D 

\DAB2^  S2]D  (4.1) 

]]• 

•  If  S  is  the  repetition  *[£1  — >  S1IB2  — ►  S2],  P 2  is  simplified  as 

*[  [ D  A  Bi  — >  Si 

\D_  A  B2->S2 

\D  A  -iBi  A  -i£2  —  D  K  ‘  ' 

}]■ 

•  The  assignment  x  :=  B,  where  B  is  an  arbitrary  boolean  expression,  is 

implemented  as  the  selection  [B  — *  x  — »  x  J.],  which  gives  for  P 2 

#[  [D  A  B  — *  x  t;  D 

ID  A^B  ^  xl-D  (4.3) 

]]• 

The  generalizations  to  the  cases  of  an  arbitrary  number  of  guarded  commands 
in  selection  and  repetition  are  obvious.  All  assignments  to  the  same  variable 
are  also  grouped  in  the  same  process.  Process  decomposition  is  applied  re¬ 
peatedly  until  the  right-hand  side  of  each  guarded  command  is  a  straight-line 
program. 

Process  decomposition  makes  it  possible  to  reduce  a  process  with  an  ar¬ 
bitrary  control  structure  to  a  set  of  subprocesses  of  only  two  different  types: 
either  a  (finite  or  infinite)  sequence  of  communication  actions,  or  a  repetition 
of  type  4.1,  4.2,  or  4.3. 


4.3  Handshaking  Expansion 

The  next  step  of  the  transformation,  the  handshaking  expansion,  replaces 
each  communication  action  in  a  program  with  its  implementation  in  terms 
of  elementary  actions,  and  each  channel  with  a  pair  of  wire-operators.  We 
shall  first  ignore  the  issue  of  message  transmission  and  implement  only  the 
synchronization  property  of  communication  primitives. 

Channel  (X,  Y)  is  implemented  by  the  two  wires  (xo  w  yi )  and  (yo  w  xi). 
If  X  belongs  to  process  PI  and  Y  to  process  P2,  then  xo  and  xi  belong  to 
PI,  and  yo  and  yi  to  P 2.  Initially,  xo ,  xi ,  yo,  and  yi — which  we  will  call  the 
“handshaking  variables  of  (X,  Y)” — are  false.  Assume  that  the  program  has 
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been  proven  to  be  deadlock-free  and  that  we  can  identify  a  pair  of  matching 
actions  X  and  Y  in  PI  and  P2,  respectively.  We  replace  X  and  Y  by  the 
sequences  Ux  and  Uy,  respectively,  with: 

Ux=  xoI\  [ xi 
Uy  —  M;  yo  1 

Also: 

xo  H-f  yi  t 
— ixo  i — >  yi  l 
yot~*  xi  | 

— ij/o  i — +  xi 

by  definition  of  the  wires.  By  4.4  and  4.5,  any  concurrent  execution  of  PI 
and  P2  contains  the  sequence  of  assignments: 

zof;  jjtf;  yo T;  ®*T  ■ 

4.3.1  Simultaneous  Completion  of  Non- Atomic  Actions 

We  introduce  a  definition  of  completion  of  a  non-atomic  action  which  makes 
it  possible  to  use  the  notion  of  simultaneous  completion  of  two  non-atomic 
actions. 

By  definition,  the  execution  of  an  atomic  action  is  considered  instanta¬ 
neous,  and  thus  the  simultaneous  completion  of  two  atomic  actions  does  not 
make  sense.  (Atomic  actions  are  simple  assignments  x  j  and  x  J,,  and  eval¬ 
uation  of  simple  guards,  i.e.,  guards  containing  one  variable.  A  wait  action 
of  the  form  [ai]  is  a  non-atomic  action  that  may  be  treated  as  the  repetition 
*[-iai  — >  sfcip].) 

A  non-atomic  action  is  initiated  when  its  first  atomic  action  is  executed. 
A  non-atomic  action  is  terminated  when  its  last  atomic  action  is  executed. 

For  non-atomic  actions,  the  notion  of  completion  does  not  coincide  with 
that  of  termination.  A  non-atomic  action  might  be  considered  completed  even 
if  it  has  not  terminated,  i.e.,  even  if  some  atomic  actions  that  are  part  of  the 
action  have  not  been  executed.  The  definition  of  suspension  is  derived  from 
that  of  completion. 

Definition.  A  non-atomic  action  X  is  completed  when  it  is  initiated  and  it 
is  guaranteed  to  terminate,  i.e.,  when  all  possible  continuations  of  the  com¬ 
putation  contain  the  complete  sequence  of  atomic  actions  of  X. 

The  above  definition  can  be  further  explained  as  follows:  Consider  a  prefix 
tl  of  an  arbitrary  trace  of  a  computation.  (A  trace  is  a  sequence  of  atomic 
actions  corresponding  to  a  possible  execution  of  the  program.)  The  comple¬ 
tion  of  X  is  identified  with  the  point  in  the  computation  where  tl  has  been 


(4.4) 

(4.5) 
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completed,  if  1)  X  is  initiated  in  tl,  and  2)  all  possible  sequences  t2,  such  that 
fl  extended  with  t2  is  a  valid  trace  of  the  computation,  contain  the  remain¬ 
ing  atomic  actions  of  X.  Hence,  the  completions  of  two  non-atomic  actions 
coincide  if  their  completion  points  coincide. 

(Observe  that  there  may  be  several  points  in  a  trace  that  can  act  as  com¬ 
pletion  point,  which  makes  it  easier  to  align  the  two  completion  points  of  two 
overlapping  sequences  so  as  to  implement  the  bullet  operator.) 

Definition.  Between  initiation  and  completion,  an  action  is  suspended. 

These  definitions  of  completion  and  suspension  are  valid  because  they 
satisfy  the  three  semantic  properties  of  completion  and  suspension  that  are 
used  in  correctness  arguments,  namely: 

•  {cX  =  x}  X  {cX  =  x  +  1}  , 

•  qX  =>  pre(X),  where  pre(X)  is  any  precondition  of  X  in  terms  of  the 
program  variables  and  auxiliary  program  variables, 

•  If  X  is  completed,  eventually  X  is  terminated. 

These  definitions  will  be  used  to  implement  the  bullet  operator  and  the 
communication  primitives  as  defined  by  axioms  Al  and  A 2.  Consider  the 
interleaving  of  Ux  and  Uy.  At  the  first  semicolon,  i.e.,  after  xo  T,  Ux  has  been 
initiated,  but  cannot  be  considered  completed  since  the  valid  continuation 
that  does  not  contain  Uy  does  not  contain  the  rest  of  £7*.  At  the  second 
semicolon,  both  XJx  and  Uy  have  been  initiated,  and  thus,  all  continuations 
contain  the  rest  of  the  interleaving  of  Ux  and  Uy.  Hence,  Ux  and  TJy  are 
guaranteed  to  terminate  when  they  are  both  initiated,  i.e.,  they  fulfil  Al  and 
A2. 


4.3.2  Four-phase  Handshaking 

Unfortunately,  when  the  communication  implemented  by  Ux  and  Uy  termi¬ 
nates,  all  handshaking  variables  are  true.  Hence,  we  cannot  implement  the 
next  communication  on  channel  (X,F)  with  Ux  and  Uy.  However,  the  com¬ 
plementary  implementation  can  be  used  for  the  next  matching  pair,  namely: 

Dx  =  xo  1;  \-yxi\ 

Dv  =  h  »*];  yo  i  • 

The  solution  consisting  in  alternating  Ux  and  Dx  as  an  implementation  of  X, 
and  Uy  and  Dy  as  an  implementation  of  Y,  is  called  two-phase  handshaking, 
or  two-cycle  signaling.  Since  it  is  in  most  cases  impossible  to  determine  syn¬ 
tactically  which  X-  or  F-actions  follow  each  other  in  an  execution,  the  general 
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two-phase  handshaking  implementations  require  testing  the  current  value  of 
the  variables  as  follows: 

xo  -ia ;o;  [xi  -  xo] 

[yi  #  yo}\  yo  :=  ~^yo  . 

In  general,  we  prefer  to  use  a  simpler  solution,  known  as  four-phase  handshak¬ 
ing,  or  four-cycle  signaling.  In  a  four-phase  handshaking  protocol,  X-actions 
are  implemented  as  “17*;  Dj,”  and  F-actions  as  'lUy]  Dy" .  Observe  that  the 
D-parts  in  X  and  Y  introduce  an  extra  communication  between  the  two  pro¬ 
cesses  whose  only  purpose  is  to  reset  all  variables  to  false. 

Both  protocols  have  the  property  that  for  a  matching  pair  (X,  Y)  of  ac¬ 
tions,  the  implementation  is  not  symmetrical  in  X  and  Y.  One  action  is 
called  active  and  the  other  one  passive.  The  four-phase  implementation,  with 
X  active  and  Y  passive,  is: 


X  =  xo  t;  [xi]; 

xo  i; 

[-i  xi] 

(4.6) 

Y  =  [yi];  yo]\ 

hs*]; 

yol  . 

(4.7) 

(We  will  introduce  an  alternative  form  of  active  implementation,  called 
lazy  active.)  Although  four-phase  handshaking  contains  twice  as  many  ac¬ 
tions  as  two-phase  handshaking,  the  actions  involved  are  simpler  and  are  more 
amenable  to  the  algebraic  manipulations  we  shall  introduce  later.  When  op¬ 
erator  delays  dominate  the  communication  costs,  which  is  the  case  for  com¬ 
munication  inside  a  chip,  four-phase  handshaking  will,  in  general,  lead  to 
more  efficient  solutions.  When  transmission  delays  dominate  the  communica¬ 
tion  costs,  which  is  the  case  for  communication  between  chips,  two-phase  is 
preferred. 

4.3.3  Probe 

A  simple  implementation  of  the  probe  X  is  xi,  with  X  implemented  as  passive. 
(Given  our  definition  of  suspension,  the  proof  that  this  implementation  of  the 
probe  fulfills  its  definition  is  straightforward.) 

A  probed  communication  action  X  — *■  ...  X  is  then  implemented  as 

xi  — +...xot;  [~ xo\  . 

4.3.4  Choice  of  Active  or  Passive  Implementation 

When  no  action  of  a  matching  pair  is  probed,  the  choice  of  which  action  should 
be  active  and  which  passive  is  arbitrary,  but  a  choice  has  to  be  made.  The 
choice  can  be  important  for  the  composition  of  identical  circuits.  A  simple 
rule  is  that,  for  a  given  channel  (X,  Y),  all  actions  on  one  port  (called  the 
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acthw  port)  are  active,  and  all  actions  on  the  other  port  (called  the  passive 
port )  are  passive.  If  X  is  used,  all  X-actions  are  passive — with  the  obvious 
restriction  that  Y  cannot  be  used  in  the  same  program. 

However,  we  shall  see  that  this  criterion  for  choosing  active  and  passive 
ports  may  conflict  with  another  criterion  related  to  the  implementation  of 
input  and  output  commands. 


4.3.5  Reshuffling 

In  4.6  and  4.7,  Dx  and  Dy  are  used  only  to  reset  all  variables  to  false.  Hence, 
provided  that  the  cyclic  order  of  the  actions  of  4.6  and  4.7  is  maintained,  the 
sequences  Dx  and  Dy  can  be  inserted  at  any  place  in  the  program  of  each  of  the 
processes  without  invalidating  the  semantics  of  the  communication  involved. 
This  transformation,  called  reshuffling ,  may  introduce  a  deadlock. 

Reshuffling,  which  is  the  source  of  significant  optimizations,  will  be  used 
extensively.  It  is  therefore  important  to  know  when  it  can  be  applied  without 
introducing  deadlock. 

There  are  two  simple  cases  where  the  reshuffling  of  sequence  “UX;DX;S” 
into  sequence  “Uxk,  S',  Dx”  does  not  introduce  deadlock: 

•  S  contains  no  communication  action,  or 

•  X  is  an  internal  channel  introduced  by  process  decomposition. 

4.3.6  Lazy-active  protocol 

Consider  the  active  implementation  of  communication  command  X: 

xo]\  [xt];  xo[\  [-m]  . 

We  introduce  an  alternative  protocol,  called  lazy  active: 

[-ixt];  xol ;  [xi]\  xo[  . 

The  lazy  active  protocol  is  derived  from  the  active  one  by  postponing  wait 
action  [->xi]  until  the  next  communication  on  X,  and  by  adding  a  vacuous 
wait  action  [->xi]  at  the  beginning  of  the  first  communication  X.  Hence,  the 
lazy  active  protocol  is  a  correct  implementation  when  combined  with  a  passive 
protocol. 

The  lazy  active  protocol  is  not  identical  to  a  passive  protocol  in  which 
the  input  variable  is  replaced  with  its  negation.  In  a  passive  protocol,  the 
effective  part  (the  upgoing  part)  of  the  protocol  is  [x*];  xo  f.  In  a  lazy  active 
one,  the  effective  part  is  xo  [x»]. 
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4.4  Production-rule  Expansion 

Production-rule  expansion  is  the  transformation  from  a  handshaking  expan¬ 
sion  to  a  set  of  PRs.  It  is  the  most  crucial  and  most  difficult  step  of  the 
compilation  since  it  requires  enforcing  sequencing  by  semantic  means.  It.  con¬ 
sists  of  three  steps:  state  assignment,  guard  strengthening,  symmetrization. 
Consider  the  handshaking  expansion 

S  =  *[[nio]i  to!  [wi];  tj; . . . ;  [wn_i];  tn— j]  , 

The  wait-conditions  are  boolean  expressions,  possibly  identical  to  true  ,  and 
the  U  are  simple  assignments.  The  extension  to  the  case  of  multiple  assign¬ 
ments  between  the  wait-conditions  is  straightforward. 

The  production  rule  expansion  of  S  is  the  transformation  of  S  into  a 
semantically  equivalent  set  of  production  rules.  Let 

P  =  {bi  s-»  t,|0  <  i  <  n} 


be  such  a  set. 


4.4.1  Notations  and  Definitions 

For  an  arbitrary  PR  p,  p.g  and  p.a  denote  the  guard  and  the  assignment  of 
p ,  respectively.  The  predicate  R(a ),  the  result  of  the  simple  assignment  a, 
is  defined  as:  R(a;t)  =  x,  and  R(x  J.)  =  ->x.  An  execution  of  a  PR  that 
changes  the  value  of  the  assigned  variable  is  called  effective,  otherwise  it  is 
called  vacuous. 

With  these  definitions,  the  stability  of  a  PR  can  be  reformulated  as  follows: 

Stability.  A  PR  p  is  stable  in  a  computation  if  and  only  if  p.g  can  be  fal¬ 
sified  only  in  states  where  R(p.a)  holds.  As  a  consequence,  p.g  holds  as  a 
postcondition  of  any  effective  firing  of  p. 

The  production-rule  expansion  algorithm  compiles  a  handshaking  expan¬ 
sion  S  into  a  set  P  of  PRs,  all  of  which  are  stable,  with  the  exception  of  those 
whose  guards  contain  negated  probes.  Since,  as  we  shall  see,  the  guards  of 
the  PRs  are  obtained  by  strengthening  the  wait-conditions  of  S,  the  stability 
of  the  wait-conditions  is  necessary  to  satisfy  the  stability  of  the  PRs. 

A  wait-condition  w  is  stable  if  once  w  is  true  ,  it  remains  true  at  least 
until  the  completion  of  the  following  assignment.  Unstable  wait-conditions 
can  be  caused  by  negated  probes  or  unrestricted  shared  variables.  The  case 
of  negated  probes  will  be  dealt  with  separately  by  introducing  synchronizers. 
We  ignore  the  use  of  shared  variables  in  these  lecture  notes. 

In  particular,  the  wait- conditions  of  the  handsaking  expansions  are  stable, 
also  after  reshuffling. 
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4.4.2  Sequencing 

The  set  P  of  PRs  implements  S  when  the  following  conditions  are  fulfilled: 
Guard  strengthening  The  guards  of  the  PRs  of  P  are  obtained  by  strengthening 
the  wait  conditions  of  S :  Vi ::  5;  =>•  Wj  and,  in  the  initial  state,  wo  =*•  &o. 
Sequential  execution  (Ni ::  biA->R(U))  <  1,  i.e.,  at  most  one  effective  PR  can 
be  executed  at  a  time. 

Program-order  execution  For  all  i:  If  Wj+i  holds  eventually  as  a  postcondition 
of  ti  in  5,  then  &;+i  holds  eventually  as  a  postcondition  of  U  in  P.  (Addition 
i  +  1  is  modulo  n.) 

The  first  condition  establishes  that  an  execution  of  PR  bi  h  {;  in  P  is 
equivalent  to  an  execution  of  [wj];t{  in  S.  The  second  and  third  conditions 
establish  that  the  order  of  execution  of  effective  PRs  of  P  is  the  order  spec¬ 
ified  by  5,  which  we  have  called  the  program-order,  and  that  no  deadlock  is 
introduced  in  the  construction  of  P. 

As  we  shall  see,  it  is  not  always  possible  to  construct,  for  a  given  handshak¬ 
ing  expansion,  a  PR  set  that  satisfies  the  above  three  conditions.  In  certain 
cases,  the  handshaking  expansion  must  be  augmented  with  assignments  to 
new  variables,  called  state  variables.  This  transformation,  which  is  always 
possible,  will  be  explained  later. 

4.4.3  Acknowledgement 

Fulfilling  the  second  and  third  conditions  requires  that  for  any  two  PRs  p  : 
6  i — »■  i  and  p'  :  b'  f',  such  that  p  immediately  precedes  p'  in  the  program 
order, 

b'  ^  R(t) 

holds  as  a  postcondition  of  p.  We  say  that  b'  is  the  acknowledgement  of  t. 
Hence  the 

Acknowledgement  Property.  For  a  PR,  set  executed  in  program  order, 
the  guard  of  each  PR  is  an  acknowledgement  of  the  immediately  preceding 
assignment. 

We  shall  see  that  the  acknowledgement  property  is  necessary  but  not  suf¬ 
ficient  to  ensure  program-order  execution. 

We  use  two  kinds  of  acknowledgements  depending  on  the  type  of  vari¬ 
able  used  in  the  assignment.  But  other  forms  of  acknowledgments  can  be 
envisioned.  If  t  assigns  an  internal  variable,  then  the  acknowledgement  is 
implemented  by  strengthening  b'  as  b'  A  R(t).  For  example,  if  t.  is  x  f,  the 
acknowledgement  is  b'  Ax. 

If  t  assigns  a  handshake  variable,  i.e.,  a  variable  implementing  a  commu¬ 
nication  command,  another  kind  of  acknowledgement  can  be  used  as  follows. 
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Acknowledgement  of  Output  Variables.  For  xo  and  xi  used  in  an  active 
protocol,  xi  is  an  acknowledgment  of  xo  f,  -<xi  is  an  acknowledgment  of  xo  J,. 
For  xo  and  xi  used  in  a  lazy-active  protocol,  xi  is  an  acknowledgment  ofxoj. 
For  yo  and  yi  used  in  the  passive  protocol  of  4.7,  ~^yi  is  an  acknowledgment 
of  yo  T,  yi  is  an  acknowledgment  of  yo  j.. 


4.4.4  Implementation  of  Stability 

Consider  a  PR  set  P,  which  implements  a  given  program  S.  We  are  going  to 
show  that  the  acknowledgement  property,  which  is  necessary  to  construct  a 
P  that  implements  S ,  is  also  sufficient  to  guarantee  stability. 

The  execution  of  a  PR  p  of  P  establishes  a  path  between  a  constant  node 
(either  VDD  or  GND),  and  the  node  implementing  the  variable — say,  x — 
assigned  by  p.  Either  p.g  holds  forever  after  p;  or  the  firing  of  another  PR  I , 
the  invalidating  PR  of  p ,  will  establish  ~'p.g,  hence  cutting  the  path  from  t.he 
constant  node  to  x. 

Let  p  be  the  complementary  PR  of  p ,  i.e.,  the  PR  with  the  complementary 
assignment.  If  the  PR  set  contains  both  p  and  p ,  then  it  also  contains  I 
because  of  the  non-interference  requirement  between  complementary  PRs. 
And  we  have  the  order  of  execution: 

p  ■<  I  -<p  . 

In  all  the  states  between  I  and  p ,  the  original  path  to  x  is  cut.  In  that 
case,  we  have  to  see  to  it  that  the  assignment  to  x  is  completed  before  the 
path  is  cut.  Hence  the 

Completion  requirement.  Assignment  p.a  is  completed  when  a  PR  q  is 
completed  whose  guard  is  an  acknowledgement  of  p.a.  The  execution  order 
of  the  PR  set  must  satisfy 

p  <q  <1 . 

vSince  this  requirement  is  already  implied  by  the  acknowledgement  prop¬ 
erty,  the  construction  of  P  automatically  guarantees  stability. 

We  can  add  an  extra  requirement  to  eliminate  the  pathological  cases  of 
“disguised”  self-invalidating  PRs,  even  though  such  cases  rarely  arise  in  prac¬ 
tice,  and  they  can  be  dealt  with  at  the  implementation  level. 


4.4.5  Self-Invalidating  PRs 

Definition.  A  PR  p  is  self-invalidating  when  R(p.a)  =>  ~^P-9- 


For  example,  ->x  h+  x  T  is  self-invalidating. 
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Self-invalidating  PRs  are  disallowed  since  they  violate  the  stability  require¬ 
ment.  Fortunately,  they  are  excluded  by  the  completion  requirement  since  it 
implies  I  ^  p. 

For  instance,  the  circuit  consisting  of  an  inverter  with  its  output  connected 
to  its  input  is  excluded  by  the  completion  requirement  since  it  corresponds  to 
the  PR  set: 

-<x  h-i  x  T 
X  I— ►  IE  J. 

and  the  two  PRs  of  the  set  are  self-invalidating.  However,  the  PR  set 

—ix  i— i i-  y  | 

a;!-*  y  l 

-iy  i—>  x  ]. 

fulfils  the  completion  requirement,  although  it  is  the  same  circuit  as  previously, 
since  the  only  change  is  the  addition  of  the  wire  y  w_x. 

We  eliminate  such  “'disguised”  self-invalidating  PRs  by  adding  the 

Restoring  Acknowledgement  Requirement.  There  is  at  least  one  restor¬ 
ing  PR  r  satisfying  p  -<  r  X  I  ,  where  r  is  restoring  if  it  is  not  part  of  a  wire 
or  a  fork. 

With  this  extra  requirement,  all  forms  of  self-invalidating  PRs  are  elimi¬ 
nated. 

It  is  remarkable  that  the  acknowledgement  requirement,  which  is  necessary 
to  enforce  the  sequential  execution  of  a  PR  set,  is  also  sufficient  to  satisfy  sta¬ 
bility.  From  now  on,  we  can  manipulate  PRs  as  if  the  transitions  were  discrete. 
However,  we  have  made  no  simplifying  assumption  on  the  physical  behavior 
of  the  system.  The  only  physical  requirement  so  far  is  that  of  monotonicity. 

Another  requirement  on  the  implementation  is  that  the  rings  of  opera¬ 
tors  that  constitute  a  circuit  keep  oscillating.  It  turns  out  that  eliminating 
self-invalidating  PRs  enforces  the  condition  that  a  ring  contain  at  least  three 
restoring  operators,  which  is  a  necessary  (and  in  practice  also  sufficient)  con¬ 
dition  for  the  ring  to  oscillate,  thanks  to  the  “gain”  property  of  restoring 
gates.  (See  [15]  for  an  explanation  of  gain.). 


Chapter  5 

Production  Rule 
Expansion 

5.1  Introduction 

In  tliis  chapter,  we  describe  the  techniques  for  production  rule  expansion  in 
more  detail.  We  first  deal  with  the  simple  case  of  a  straightline  program. 
The  general  case  of  a  set  of  guarded  commands  is  introduced  in  one  example. 
We  also  introduce  the  next  step  of  the  compilation,  called  operator  reduction , 
which  produces  a  network  of  cells  from  production  rules. 


5.2  Straightline  Programs 

As  a  first  example,  let  us  implement  the  simple  process  ( L/R ),  where  R 
is  an  active  channel.  This  process  is  one  of  the  basic  building  blocks  for 
implementing  sequencing.  The  handshaking  expansion  gives: 

*[[!*];  rot;  M;  rot;  h™];  lot;  H*];  io  l]  •  l5-1) 

We  now  consider  the  handshaking  expansion  as  the  specification  of  the  im¬ 
plementation:  Any  implementation  of  the  program  has  to  satisfy  the  ordering 
defined  by  5.1.  The  next  step  is  to  construct  a  production-rule  set  that  satis¬ 
fies  this  ordering.  We  start  with  the  production-rule  set  that  is  syntactically 
derived  from  5.1: 

li  ro  | 

7-J  I— >  TO  J, 

—iri  t—>  lo  t 
-i li  i — »  /o  J, 
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(As  a  clue  to  the  reader,  PRs  of  a  set  are  listed  in  program  order.) 

Since  the  program  is  deadlock-free,  effective  execution  of  the  PRs  in  pro¬ 
gram  order  is  always  possible.  However,  some  other  execution  orders  may  also 
be  possible.  If  execution  orders  other  than  the  program  order  are  possible  for 
the  production-rule  set,  the  guards  of  some  rules  are  strengthened  so  as  to 
eliminate  these  execution  orders. 

In  our  example,  program  order  is  not  the  only  execution  order  for  the 
syntactic  production-rule  set:  Since  ~<ri  holds  initially,  the  third  PR  can  be 
executed  first.  This  is  also  true  for  the  fourth  PR;  but  the  execution  of  the 
fourth  rule  in  the  initial  state  is  vacuous.  Because  all  handshaking  variables 
of  R  are  back  to  false  when  R  is  completed,  we  cannot  find  a  guard  for  the 
transition  that  holds  only  as  a  precondition  of  lo  T  in  5.1.  Hence,  we 
cannot  distinguish  the  state  following  R  from  the  state  preceding  R,  and  thus 
the  sequential  execution  condition  introduced  in  section  8  cannot  be  satisfied. 

In  order  to  fulfil  the  sequential  execution  condition,  we  have  to  guarantee 
that  each  state  of  the  handshaking  expansion  is  unique,  i.e.,  there  exists  a 
predicate  in  terms  of  variables  of  the  program  that  holds  only  in  this  state. 
The  task  of  transforming  the  handshaking  expansion  so  as  to  make  each  state 
unique  is  called  state  assignment. 


5.3  State  Assignment  With  State  Variables 

The  first  technique  to  define  uniquely  the  state  in  which  the  transition  /of 
is  to  take  place  consists  in  introducing  a  state  variable,  say  x.  initially  false. 
Handshaking  expansion  5.1  becomes 

to  T;  H;  x  [ij;  roj;  lop.  [-di];  x  p  [-*as];  io|]  .  (5.2) 

Observe  that  5.2  is  semantically  equivalent  to  5.1,  since  the  two  sequences 
of  actions  that  are  added  to  5.1,  namely,  x  f ;  [x]  and  x  1;  [— >m],  are  equivalent 
to  a  skip.  (The  newly  introduced  variable  x  is  used  nowhere  else.)  There 
are  several  places  where  the  two  assignments  to  the  state  variable  can  be 
introduced.  We  shall  not  discuss  here  the  different  heuristics  that  are  used  in 
the  placement  of  the  variables.  But  it  is  important  to  observe  that  minimizing 
the  number  of  state  variables  is  not  a  relevant  criterion  in  the  choice  of  a  state 
assignment.  What  counts  is  minimzing  the  number  of  transitions  on  state 
variables,  and  the  sizes  of  the  production-rules  guards. 


5.4  The  Basic  Algorithm  For  PR  expansion 

We  consider  a  straightline  handshaking  expansion,  and  assume  that  state 
assignment  has  been  performed.  Hence,  each  state  of  the  handshaking  expan- 
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sion  is  unique,  and  we  can  therefore  generate  a  PR  set  that  is  semantically 
equivalent  to  the  handshaking  expansion. 

For  the  time  being,  we  assume  that  each  assignment  to  a  variable,  such 
as  x  t  or  x  j,  occurs  at  most  once  in  the  program.  This  restriction  is  easily 
enforced  by  renaming;  in  the  case  of  a  program 

p  ==  . . .  x  jj  X  x  T , .  •  • ,  x  j. ,  ... 

we  can  rename  the  variable  as 

p'  =  . . .  ml  t;  ...;  ml  X;  •  ••!  ^2 1 ;  ...;x2  j;  ... 

We  first  perform  the  handshaking  expansion  of  p'.  We  then  observe  that  since 
-nl  V  -ix2  holds  at  any  time,  we  can  combine  xl  and  x2  by  the  two  rules: 

xl  V  x2 1-»  x  | 

-ixl  A  -ix2  i->  x  l . 

If  we  treat  the  cases  of  selection  and  repetition  separately,  we  do  not  have  dis¬ 
junctions  in  wait-actions.  Hence,  we  can  construct  all  production-rule  guards 
as  conjunctions;  disjunction  will  be  introduced  next  in  the  symmetrization 
step. 

5.4.1  First  Method:  Weakening  Strong  Guards 

Since  each  state  of  the  handshaking  expansion  is  uniquely  defined,  the  set  of 
production  rules  in  which  each  guard  is  the  strongest  predicate  in  this  state 
is  ordered. 

The  set  of  strongest  guards  is  constructed  mechanically  by  determining 
in  each  state  the  value  of  all  variables  that  are  defined  in  that  state:  the 
strongest  predicate  in  that  state  is  the  conjunction  of  all  terms  that  are  true 
in  that  state. 

We  can  then  simplify  the  guards  of  the  PRs  by  using  program  properties 
of  the  form  “F  =>  R  holds  as  a  precondition  of  the  PR ”  to  replace  P  A  R  by 
P.  (This  method  has  been  proposed  and  used  by  Huub  Schols.) 

5.4.2  Second  Method:  Strengthening  Weak  Guards 

The  second  method,  which  we  have  been  using  most  of  the  time,  starts  with 
the  weakest  set  of  guards  and  strengthens  them  until  the  production  rule  set 
is  ordered. 

For  each  assignment,  the  initial  guard  of  the  production  rule  is  the  wait 
action  that  precedes  it  in  the  handshaking  expansion.  When  the  assignment — 
say.  5 —  is  preceded  by  another  assignment,  we  introduce  the  net-effect  of  the 
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preceding  assignment  as  wait  action: 

x  5  is  replaced  by  x  f;  [x];  S 
x  S  is  replaced  by  x  J.;  [— i*];  S 

For  each  assignment,  we  define  two  sets  of  states: 

•  the  firing  set,  which  is  the  set  of  all  states  in  which  the  guard  of  the 
assignment,  holds;  and 

•  the  conflicting  set,  which  is  the  set  of  all  states  in  which  the  firing  of  the 
assignment  must  be  disallowed.  For  assignment  S,  let  S'  be  the  complemen¬ 
tary  assignment.  The  conflicting  set  is  the  set  of  contiguous  states  starting  at 
the  state  preceding  S'  and  ending  at  the  state  preceding  the  assignment  that 
preceeds  S. 

The  “window  of5”  is  the  intersection  of  the  firing  set  and  the  conflicting 
set  of  S.  The  window  set  must  be  empty  (“the  window  is  closed”).  If  it  is 
not,  we  shrink  the  firing  set  of  S  (by  strengthening  the  precondition)  until 
the  intersection  is  empty. 

Because  each  state  can  be  uniquely  characterized  in  terms  of  the  program 
variables,  it  is  always  possible  to  close  the  window  of  each  assignment  by 
strengthening  the  guards.  There  may  be  several  possible  ways  to  strengthen 
a  guard.  We  choose  the  one  that  is  the  simplest  (least  number  of  variables) 
and  that  is  best  suited  for  symmetrization  of  the  rules,  which  is  explained 
later. 

As  an  example  of  the  use  of  the  algorithm,  we  prove  a  theorem  that  iden¬ 
tifies  standard  production  rules  that  need  not  be  strengthened.  This  result 
significantly  reduces  the  number  of  cases  to  be  considered. 

Theorem  1.  Production  rule  xt  i->  xo  I  of  the  active  expansion  of  commu¬ 
nication  action  X  and  production  rule  -<xi  i-+  xo\  of  the  passive  expansion 
of  communication  action  X  are  always  ordered. 

Proof.  The  active  handshaking  expansion  of  X  is 

xoT;  [xi];  xo  J,;  [— 

For  xi  xo  i,  the  firing  set  starts  at  the  precondition  of  xo  I  and  ends  at 
the  postcondition  of  xo  [.  The  conflicting  set  starts  at  the  precondition  of 
xo  t  and  ends  at  the  postcondition  of  xo  | .  Observe  that  even  with  reshuffling 
these  two  sets  are  disjoint:  the  window  is  closed. 

The  passive  handshaking  expansion  of  X  is: 

[xi];  xoj;  [-i£*];  xof 

For  -ixi  t— >  xo  !,  the  firing  set  starts  at  the  precondition  of  xo[  and  ends  at 
any  place  before  [xi].  The  conflicting  set  starts  at  the  precondition  of  xo  \  and 
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ends  at  the  postcondition  of  xo | .  Again,  even  with  reshuffling,  the  window  is 
always  closed.  □ 

A  similar  theorem  holds  for  standard  production  rules  involving  state  vari¬ 
ables. 

Theorem  2.  For  state  variable  u,  introduced  as  follows  in  the  active  hand¬ 
shaking  expansion  of  X: 


xof;  [ar*];  [«];  xo|;  bxi]  (5.3) 

the  production  rules  xi  u  ]  and  u  *-*  xo  ]  are  ordered.  For  state  variable 
u,  introduced  as  follows  in  the  passive  handshaking  expansion  of  X : 

[xi];  xof;  [xo];  u]\  [it];  [ixt];  xoj,  (5.4) 

the  production  rule  xo  i— »  it  t  is  ordered.  The  same  results  hold  if  any  of  the 
variables  involved  is  replaced  by  its  complement. 

The  proof,  which  is  similar  to  that  of  Theorem  1,  is  omitted.  The  results 
of  Theorem  2  indicate  that  passive  handshaking  is  more  difficult  to  deal  with 
than  active  handshaking. 

Let  us  now  complete  the  production-rule  expansion  of  the  Q-element. 
Since  x  has  been  introduced  to  distinguish  the  prestate  of  to  ]  from  the 
prestate  of  lo],  we  can  immediately  strengthen  the  guard  of  ro]  with  ->x 
and  the  guard  of  lo  ]  with  x.  We  get: 


-ix  A  li 

t— i ¥ 

ro  T 

(5.5) 

ri 

I-+ 

x| 

(5.6) 

X 

\ — > 

ro  J. 

(5.7) 

x  A  *-i ri 

1— > 

lo] 

(5.8) 

■n/i 

h-» 

X  i 

(5.9) 

“iX 

1 — * 

lo  i 

(5.10) 

It  is  easy  to  check  in  5.2  that  the  strenghtenings  of  the  guards  of  5.5  and 
5.8  close  the  two  windows.  We  further  observe  in  5,2  that  the  introduction 
of  x  T  in  the  handshaking  expansion  of  R,  and  the  introduction  of  x  j.  in  the 
handshaking  expansion  of  L  both  fulfil  property  5.3  of  Theorem  2.  Hence, 
according  to  Theorem  2,  5.6,  5.7,  5.9,  and  5.10  are  ordered,  and  the  above 
handshaking  expansion  is  program  ordered. 


5.5  Operator  Reduction 

The  last  step  of  the  compilation,  called  operator  reduction,  groups  together 
the  PRs  that  assign  the  same  variables.  Those  PRs  are  then  identified  with 
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(and  implemented  as)  an  operator.  The  program  is  thus  identified  with  a  set 
of  operators. 

Since  we  have  enforced  the  stability  of  each  rule  and  non-interference  be¬ 
tween  any  two  complementary  rules,  we  can  implement  any  set  of  PRs  directly. 
(For  reasons  of  efficiency,  we  have  to  see  to  it  that  the  guards  do  not  contain 
too  many  variables  in  a  conjunct,  which  would  lead  to  too  many  transistors 
in  series.  Hence,  the  implementation  of  the  set  may  also  involve  decomposing 
a  PR.  into  several  PRs  by  introducing  new  internal  variables.) 

The  direct  operator  implementation  of  the  PR  set  is  straightforward: 

The  PRs  that  set  and  reset  ro  correspond  to  the  asymmetric  C-element 
(—■m,  Zi)  aC  ro. 

The  PRs  that  set  and  reset  lo  correspond  to  the  asymmetric  C-element 
( x ,  -i ri)  aC  lo. 

The  PRs  that  set  and  reset  x  correspond  to  the  flip-flop  (ri,  li)  ff_  x. 

If  the  above  operators  are  implemented  as  dynamic,  this  implementation 
of  process  ( L/R )  is  the  simplest  possible.  If  static  implementations  of  the  op¬ 
erators  are  required,  another  implementation  might  be  considered  with  fewer 
state-holding  elements  since,  as  we  have  explained  in  the  first  part,  static 
state-holding  operators  are  slightly  more  difficult  to  realize  than  combina¬ 
tional  operators. 

A  last  transformation,  called  symmetrization,  may  be  performed  on  the 
PR  set  to  minimize  the  number  of  state-holding  operators.  However,  since 
symmetrization  also  introduces  inefficiencies  of  its  own,  it  should  not  be  ap¬ 
plied  blindly. 


5.6  Symmetrization 

Symmetrization  is  performed  on  the  two  guards  of  PRs  61  i->  z  |  and  62  ►  2 1, 
when  one  of  the  two  guards,  say,  61,  is  already  in  the  form  x  A  -62.  If  we 
replace  guard  62  with  ->x  V  62,  then  the  two  guards  are  complements  of  each 
other;  i.e.,  the  operator  is  combinational.  Of  course,  weakening  guard  62  is  a 
dangerous  transformation  since  it  may  introduce  a  new  state  where  the  guard 
holds.  We  have  to  check  that  this  does  not  occur  by  checking  the  following 
invariant: 

Given  the  new  rule  ->xV62  i z  [,~>z  must  hold  in  any  state  where  -ixA~>62 
holds;  i.e.,  we  have  to  check  the  invariant  truth  of 


x  V  62  V  -iz  . 
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5.6.1  Operator  Reduction  of  the  (L/R)-element 

The  symmetrization  of  the  PRs  of  the  (Z/i?)-clemcnt  gives: 

~>x  A  /*  t->  ro| 
ri  i-»  x  | 

-dz  V  x  h->  ro  l 
x  A  -i ri  *-»  lo  | 

-dz  i — >■  a;  4- 
ri  V  -ix  lo  | 

The  PRs  that  set  and  reset  ro  correspond  to  the  and- operator  (-is,  li)  A  ro. 
The  PRs  that  set  and  reset  x  correspond  to  the  flip-flop  ( ri,li )  ff_x. 
The  PRs  that  set  and  reset  lo  correspond  to  the  and-operator  (s,  -> ri)  A 
lo. 

The  flip-flop  can  be  replaced  with  the  C-element  C_x. 

The  resulting  circuit  is  shown  in  Figure  5.1.  (The  dot  identifies  the  input  that 
is  activated  first.)  This  implementation  of  ( L/R ),  either  with  a  flip-flop  or 
with  a  C-element,  is  called  a  Q-element.  The  Q-element  implementing  (L/R) 
as  above  is  described  by  the  infix  notation  (li,lo)  Q  (ri.ro). 


Figure  5.1:  Implementation  of  (L/R)  with  a  Q-element 
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5.7  Isochronic  Forks 

In  the  previous  operator  reduction,  li  is  an  input  to  the  flip-flop  (li,ri)  ff_x, 
and  to  the  and-operator  (li,-ix)  Aro.  Formally,  in  order  to  compose  the  PRs 
together  to  form  a  circuit,  we  have  to  introduce  the  fork  lif_ (II,  12)  and  replace 

11  by  /I  as  input  of  the  and-operator,  and  by  12  as  input  of  the  flip-flop.  We 
also  have  to  introduce  the  forks  ri  f_  (rl,r2)  and  x  /  (xl,x2)  for  the  same 
reason. 

Let  us  analyse  the  effect  of  the  first  fork  only.  The  PR  set  that  includes 
the  PRs  of  the  fork  is: 

li *-*  /1T,J2T 
-ix  A  11  >->  ro  t 
ri  i— >  x  t 
-ill  V  x  i-*  ro  i 
x  A  -i ri  i — ►  lo  T 

-.lit-*  11}, 12 1 
-il2 1 — ►  x  1 
ri  V  -ix  i— >  lo  } 

Now  we  observe  that  transition  il  |  o  is  acknowledged  by  the  guard  of  the 
textually  following  PR  but  12  T  is  not;  transition  12  }  of  is  acknowledged  by 
the  guard  of  the  textually  following  PR  but  11 J,  is  not.  Hence,  the  assignments 

12  T  and  11  J.  do  not  fulfil  the  completion  requirement,  and  thus  are  not  stable! 

We  solve  this  problem  by  making  a  simplifying  assumption:  We  assume 
that  the  fork  is  isochronic,  i.e.,  the  difference  in  delays  between  the  two 
branches  of  the  fork  is  shorter  than  the  delays  in  the  operators  to  which  the 
fork  is  an  input.  Hence,  when  a  transition  on  one  output  is  acknowledged  and 
thus  completed,  the  transition  on  the  other  output  is  also  acknowledged  and 
thus  completed. 

This  is  the  only  timing  condition  that  has  to  be  fulfilled.  In  general,  the 
constraint  is  easy  to  meet  because  it  is  one-sided.  However,  the  isochronicity 
requirement  is  more  difficult  to  meet  when  a  negated  input  introduces  an 
inverter  on  a  branch  of  the  fork,  since  the  transition  delays  of  an  inverter  are 
of  the  same  order  of  magnitude  as  the  transition  delays  of  other  operators. 
We  have  proved  that,  for  the  implementation  of  each  language  construct, 
these  inverters  can  always  be  eliminated  from  the  isochronic  forks  by  simple 
transformations.  (These  transformations  have  not  been  applied  to  the  circuits 
presented  here  as  examples,  but  they  are  always  applied  before  the  circuits 
are  actually  implemented.) 

In  [14],  we  have  proved  that  the  class  of  entirely  delay-insensitive  circuits 
is  very  limited:  Practically  all  circuits  of  interest  fall  outside  the  class.  We 
believe  that  the  notion  of  isochronic  fork  is  the  weakest  compromise  to  delay- 
insensitivity  sufficient  to  implement  any  circuit  of  interest. 


5.8.  RESHUFFLED  IMPLEMENTATIONS  OF  ( L/R ) 


61 


Which  forks  have  to  be  isochronic  is  easy  to  decide  by  a  simple  analysis  of 
the  PR  sets.  For  instance,  the  fork  rif  (rl,r2)  also  has  to  be  isochronic,  but 
the  fork  xf  (xl,x2)  does  not.  We  shall  ignore  the  issue  of  isochronic  forks  in 
the  rest  of  this  presentation. 

5.8  Reshuffled  Implementations  of  (L/R) 

We  illustrate  the  use  of  reshuffling  by  deriving  two  other  implementations 
of  (L/R).  If  L  is  an  internal  channel  introduced  for  process  decomposition, 
we  can  reshuffle  the  handshaking  expansions  of  L  and  R  without  the  risk  of 
introducing  deadlock.  Let  us  return  to  handshaking  expansion  (14). 

5.8.1  First  Reshuffling 

We  postpone  the  second  half  of  the  handshaking  expansion  of  R  — i.e.,  the 
sequence  ro  J.;  [->ri]— until  after  [-di].  We  get: 

*[[!*];  rot;  [rt];  lot;  H*];  ro  [-<«];  lo  |]  . 

The  syntactic  PR.  expansion  we  now  derive  is  already  ‘‘program  ordered”: 

li  e->  to  t 
ri  t-»  lo  t 
i — »  ro  | 

-i ri  t— »  lo  l  . 

The  first  and  third  rules  specify  the  wire  (liw_ro),  the  second  and  fourth  rules 
specify  the  wire  (riwlo).  Hence,  the  implementation  reduces  to  two  wires! 

5.8.2  Second  Reshuffling:  The  D-element 

We  now  postpone  the  whole  handshaking  expansion  of  R  until  after  [->!*]-  We 
get: 

*[[!*];  lo T;  H*];  rot;  [rt];  rot;  [t*];  lot]  • 

We  need  to  introduce  a  state  variable,  say  x,  as  follows: 

xt;  [xj;  lot;  H»];  rot;  M;  xt;  h®];  rot;  [->ri];  lot]  ■ 

The  PR  expansion  gives: 

li  i — .  X  t 

(riV)x>->  lot 
x  A  — i/i  i — »•  rot 
ri*-*  x  l 
(11V)- oh  rot 
-ix  A  -irt  i  *  lo  t  * 
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The  terms  between  parentheses  have  been  added  for  symmetrization.  The 
operator  reduction  gives: 

ff_x 
(ri,  x )  V  lo 
(x, -I li)  Aro  . 

The  flip-flop  can  be  replaced  with  the  C-element  (li,  ->ri)  C_  x.  The  circuit  is 
shown  in  Figure  5.2;  it  is  called  a  D-element. 


Figure  5.2:  A  circuit  for  the  D-element 


5.9  Example  2:  A  One-place  Buffer 

The  one-place  buffer  is  the  most  ubiquitous  process.  In  the  processor  for 
example,  each  stage  of  the  pipeline  is  a  one-place  buffer  of  the  type: 

*[LIx]R\f(x)}  . 

Let  us  ignore  the  transmission  of  messages,  and  implement  the  “bare”  process: 
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One  of  the  most  useful  implementations  of  this  process  is  with  L  lazy-active 
and  R  passive.  The  handshaking  expansion  gives: 

lot;  [l*];  M;  [^*];  toV,  [-’»’*];  rot]  . 

We  choose  to  include  the  state  variable  x  in  such  a  way  that  the  transition 
x  t  is  concurrent  with  lo  f,  and  transition  as  X  is  concurrent  with  ro  f .  We  get: 

*[b !*];  <<?t;  * t;  M;  [/*);  M;  M;  r°t;  x t;  h®];  hH;  r°i]  ■ 

The  production  rule  expansion  is: 

-ix  A  -<li  A  -iro  i — ►  lo  t 

lo*-*  X  t 

x  A  li*~*  lo  f 
x  A  -do  A  ri  i— »  ro  t 

TO  I — ►  X  f 

-ix  A  ~i ri  i-+  ro  [ 

The  direct  implementation  of  this  production  rule  set  is  shown  in  Figure  5.3. 

x  l i  ^  o  r j  r0 


Figure  5.3:  A  circuit  for  the  one-place  buffer 
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5.10  Boolean  Register 

Consider  the  following  register  process  that  provides  read  and  write  access  to 
a  simple  boolean  variable,  x: 

*[[  P  — *  PIx 

lQ->Q'.x  (5.11) 

11. 


where  ->P  V  ->Q  holds  at  any  time. 

The  handshaking  expansion  uses  the  double-rail  technique:  The  boolean 
value  of  x  is  encoded  on  two  wires,  one  for  the  value  true  and  one  for  the 
value  false  .  Input  channel  P  has  two  input  wires,  pil  for  receiving  the  value 
true  ,  and  pi2  for  receiving  the  value  false  ;  and  one  output  wire,  po.  Output 
channel  Q  has  two  output  wires,  qol  for  sending  the  value  true  ,  and  qo2  for 
sending  the  value  false  ;  and  one  input  wire,  qi.  Each  guarded  command  is 
expanded  to  two  guarded  commands: 

*[[p*i-»a:T;  M;  p°t;  N«i];  p°l 

\pi2  ->  £  1;  [-!*];  pot;  [->pi2]:  pot 
i  X  A  qi  -*■  qol  f;  [-igi];  qol  l  (5.12) 

|  -ix  A  qi  — >  qo2  [-igi];  qo2  1 

]]  • 

5.10.1  Mutual  Exclusion  Between  Guarded  Commands 

We  are  now  faced  with  a  new  problem:  enforcing  mutual  exclusion  between 
the  production-rule  sets  of  different  guarded  commands.  (This  problem  is 
not  concerned  with  making  the  guards  of  the  different  commands  mutually 
exclusive.  For  the  time  being,  we  are  considering  only  examples  where  the 
guards  of  the  commands  are  already  mutually  exclusive.)  Let.  us  illustrate  our 
problem  with  the  compilation  of  the  first  two  guarded  commands.  If  we  just 
concatenate  the  production-rule  sets  of  these  two  commands,  we  get: 

pil  i-»  x  | 
pil  Aih  po | 

-pil  t-+  po  l 
pi2 1->  x  l 
pi2  A  -ix  i-»  po  1 
-ipi2  h  po  1  . 

However,  the  second  and  the  sixth  guarded  commands  are  interfering  since 
they  set  and  reset  variable  po  concurrently.  For  reasons  of  symmetry,  the  same 
holds  for  the  third  and  the  fifth  PRs. 
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The  problem  of  ensuring  mutual  exclusion  between  PRs  of  different  guarded 
commands  is  the  same  as  enforcing  program  order  between  PRs  of  the  same 
guarded  command.  We  use  the  same  technique,  which  consists  in  strength¬ 
ening  the  guards  of  the  production  rules,  if  necessary,  by  introducing  state 
variables  to  distinguish  between  the  states  corresponding  to  each  true  guard. 

In  the  case  at  hand,  we  can  strengthen  the  guards  of  the  third  and  the 
sixth  rules  by  combining  the  two  rules  as: 

~>pil  A  ->p*2  i->  poj  . 

The  non-standard  gate  implementing  the  production  rules  of  po  is  shown  in 
Figure  5.4. 


Figure  5.4:  Non-standard  gate  for  write  acknowledge 


We  can  also  strengthen  the  guards  of  the  third  and  the  sixth  rules  as: 

x  A  -ip*  1 1— >  po  J. 

-*x  A  -ip*2 1->  po  |  . 

Now,  the  PRs  of  po  can  be  transformed  into 

(pfl  A  as)  V  (p*2  A  -ix)  t->  po  j 
(-ipil  A  -i®)  V  (-ip*2  A  a:)  e-*  po  j  , 

which  is  the  definition  of  the  */-operator  (p*l,p*2,  x)  if  po  . 

The  rest  of  the  implementation  is  straightforward.  The  first  and  fourth 
PRs  correspond  to  the  flip-flop  (p*l,  -<pi2)ffx.  The  production-rule  expansion 
of  the  last  two  guarded  commands  gives: 

x  A  qi  i— *  qol  f 
-ix  V  -i  qi  e-*  qol  f 
-ix  A  qi  t->  qo2  1 
x  V  -i<7*  qo2  [  , 
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which  corresponds  to  the  two  operators  {x,qi)  Aqol  and  (-ia :,qi)  Aqo2.  The 
circuit  is  represented  in  Figure  5.5. 


Figure  5.5:  Single  boolean  register 


In  the  next  example,  we  shall  refer  to  the  implementation  of  the  first  two 
guarded  commands  as  the  register  operator: 

(pil.pi2)  reg  ( po,x )  . 

We  shall  refer  to  the  implementation  of  the  last  two  guarded  commands  of 
(26)  as  the  read  operator: 

(qi,  x)  read  (qol,  qo2)  . 


5.11  Process  Factorization 

The  next  example  is  used  to  introduce  the  technique  of  process  factorization. 
The  idea  is  to  decompose  a  process,  say,  p,  described  as  a  handshaking  ex¬ 
pansion  into  a  number  of  processes  pO,  pi  . . .,  pn  such  that  (pO||pl|| . . .  ||pn) 
is  equivalent  to  p,  i.e.,  implements  the  same  handshaking  sequence  as  p. 
Factorization  obeys  two  rules. 

•  Rule  1:  Each  output  variable  belongs  to  exactly  one  factor  process. 
(Hence  factorization  reduces  the  number  of  output  variables  per  pro¬ 
cess.)  Input  variables  may  be  shared  by  several  factor  processes. 
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•  Rule  2:  Two  adjacent  actions  a;  /?  of  the  original  process  are  put  into 
two  different  processes  during  factorization  if,  and  only  if,  the  semicolon 
between  a  and  /3  is  superfluous.  Two  cases  fulfill  this  condition: 

1.  the  two  adjacent  actions  {— >rc)  x  f;  [x]  and  the  two  adjacent  actions 
{x}  x  [-ixj  for  internal  variable  x,  and 

2.  the  pairs  of  handshaking  actions  xoj;  [xi]  and  xoj;  [-i xi\  for 
an  active  implementation,  and  the  pair  of  handshaking  actions 
2/o  j";  [~‘yi\  for  a  passive  implementation.  (This  is  a  direct  con¬ 
sequence  of  Property  1.) 

5.11.1  Example:  Two-to-Four  Phase  Converter 

The  following  process  converts  a  passive  two-phase  handshaking  on  channel 
L  into  an  active  four-phase  handshaking  on  channel  R.  First  observe  that 
the  converter  cannot  be  specified  as  a  buffer  *[T;.R].  Indeed,  let  (L\R!)  be 
the  channel  on  which  the  converter  is  to  be  inserted.  This  channel  maintains 
the  relation  cL'  =  cR.'.  The  converter  should  leave  it  unchanged.  But  if 
0  <  cL  —  cR  <  1,  then  0  <  cL'  —  cR'  <  1  holds  after  insertion  of  the 
converter.  Hence,  we  have  to  implement  the  converter  such  that  cL  —  cR , 
i.e.,  we  have  to  interleave  the  handshaking  of  L  and  R  in  such  a  way  that  L 
and  R  are  completed  at  the  same  time.  We  get: 


conv  ~  *[\li]\  rof;  I7-*];  roj;  [-,ri];  J°T;  [“'I*'];  r-o t ;  [ri];  rof;  lot] 

(There  are  several  ways  to  interleave  the  handshake  sequences  of  two  ac¬ 
tions  so  as  to  make  their  completions  coincide.  Again,  we  have  chosen  the  one 
in  which  the  waits  and  the  assignments  alternate.)  We  first  try  to  factorize 
conv  into  two  processes,  pi  and  p2.  We  get 

pi  =  *\[li];  ro T;... 

p2  =  *[[ri];  rot;  ■  ■  ■ 

Here  the  factorization  fails  since  it  violates  rule  1.  Rule  1  is  violated  because 
actions  ro  f  and  ro  f  follow  each  other  as  output  actions  in  conv.  We  can 
separate  the  two  output  actions  ro  f  and  ro  f  by  inserting  a  vacuous  sequence 
u  [ft]  on  a  newly  introduced  internal  variable  u.  (Initially,  u  =  false.)  We 
introduce  this  sequence  after  the  first  [rt];  for  reasons  of  symmetry,  we  intro¬ 
duce  the  sequence  u  J.;  [-'ll]  after  the  second  [ri].  The  transformed  program 
is: 

conv1  =  *[  [H];rof;  [ri];w  f;  [n];ro  j;  [— >rz]:  lo  T; 

[-iff];  ro  | ;  [ri];  u  [:  [-m];  ro  f ;  [— irij;  lo  l 
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Now,  we  can  apply  factorization  rule  2  without  violating  rule  1.  We  get: 
pi  =  ro  T ;  [nj;  ro  1;  [—>?*];  ro  f ;  [-itt];  ro  1] 

p2  =  *[[«];  ttT;  [“”■*];  ioT;  H;«i;  fo  J.] 

It  is  easy  to  verify  that  (pljjp2)  =  conv'.  Since  the  sequences  u  ] ;  [it]  and 
nj,;  [-itt]  are  both  equivalent  to  a  skip  in  conv'.  (pljjp2)  =  conv. 

Process  p2  can  immediately  be  identified  as  a  standard  process  called 
a  toggle,  represented  by  the  infix  operator  ri  tog  (it;  Jo).  For  pi,  we  first 
strengthen  the  guards  as  follows: 

pi  =  A  I*];  rot;  [1®  A  u];ro|;  [->i*  A  u];ro  t;  [_|J*  A  -iu];ro  J.] 

The  validity  of  this  transformation  relies  on  invariants  from  conv1;  it  cannot 
be  justified  by  properties  of  pi  only. 

Now  pi  can  be  identified  with  a  difference-operator:  (u,li)  dif  ro,  also 
called  an  exclusive-or.  The  corresponding  circuit  is  shown  in  Figure  5.6. 


Figure  5.6:  Two-to-four  phase  converter 


The  kind  of  process  factorization  we  have  described  in  the  previous  section 
is  very  helpful  but  can,  in  principle,  be  avoided  by  applying  the  standard 
technique  for  production-rule  expansion.  One  case  of  process  factorization 
that  cannot  be  avoided  is  when  a  process  has  to  be  decomposed  into  two  or 
more  processes,  one  of  which  is  given.  For  reasons  that  will  become  clear  in 
a  the  following  chapters,  we  call  this  transformation  “process  quotient5'. 
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5.12  Sequencing 

There  are  many  ways  to  implement  the  sequencing  of  n  arbitrary  actions.  We 
shall  introduce  the  basic  operators  that  are  used  in  the  most  straightforward 
implementations. 

5.12.1  The  Active- Active  Buffer 

Consider  the  program  *[Si;S,2],  where  Si  and  S2  are  two  arbitrary  program 
parts.  Process  decomposition  of  this  program  gives 

*[L;R}\\  (L'/Si)\\  ( R'/S2 ). 

Hence,  the  basic  sequencing  operator  is  the  process 

B(La,Ra)  =  *[L]R]  , 

where  both  L  and  R  are  active.  This  process  is  called  an  active-active  buffer. 
The  handshaking  expansion  gives: 

*[loT;  [/«];  lo|;  [-ift];  rof;  [r*];  roj.;  [— iri]]  .  (5.13) 

Since  ri  is  false  initially,  we  can  rewrite  5.13  as: 

*[[-iri];  lof;  [li];  lo  1; [—>/*];  rot;  [rij;  r°l]  •  (5-14) 

By  comparing  5.14  with  (14) — the  handshaking  expansion  of  the  Q-element, 
we  observe  that  B(La,Ra)  =  i~'ri,ro)  Q(li,lo)  ,  which  gives  the  implemen¬ 
tation  of  Figure  5.7. 

5.12.2  The  (L/A;R)-element 

In  order  to  generalize  the  above  construction  to  the  case  of  an  arbitrary  num¬ 
ber  of  actions,  we  need  to  implement  the  generalization  of  the  ( L / ii)-element. 
Sequence 

*[Si;S2  (5-15) 

can  be  decomposed  into  a  number  of  shorter  sequences  by  repeatedly  ap¬ 
plying  process  decomposition.  There  are  as  many  ways  to  decompose  5.15  as 
there  are  binary  trees  of  n  leaves.  But  observe  that,  if  n  >  2,  all  decomposi¬ 
tions  will  require  at  least  one  process  of  the  form: 

(L/A-,R), 

where  A  and  R  are  active  communication  actions.  (The  semicolon  binds 
more  tightly  than  the  process  call.)  We  shall  use  two  different  reshufflings  to 
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Figure  5.7:  Implementation  of  the  active-active  buffer  with  a  Q-element 


implement  this  process.  Again,  these  reshufflings  maintain  the  semantics  of 
the  original  program  if  the  handshaking  expansion  of  L'  is  not  reshuffled. 

The  first  reshuffling  is: 

ao|;  [ai];  loti  [~ aoi>  [~ 'O-i];  R:  ioj,]  . 

We  decompose  it  into  two  sequences  by  applying  a  process-factorization 
decomposition  described  earlier: 

(*[[11];  act;  hi*];  ao|] 

IHM;  lot;  hoi];  R\  lo|] 

)•. 

The  first  sequence  is  the  wire  (li  wao).  The  second  sequence  is  the  D- 
element  (aijo)  D_(ri,ro). 

The  second  reshuffling  is: 

*[[!*];  A;  ro T;  H;  lot;  [->i*];  to  i;  hr*li  1°U  • 
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Again,  we  decompose  it  into  two  sequences  by  process  factorization: 

(*[M;  lo  T;  [-ft];  /o|] 

||*[[ft];  A;  rof;  [-1  li];  roj] 

)• 

The  first  sequence  is  the  wire  (ri  w  lo).  The  second  sequence  is  the  Q- 
element  (li,ro)  Q_  (ai,ao).  Both  implementations  are  shown  in  Figure  5.8. 


Figure  5.8:  Implementations  of  the  (L/A;  7J)-element 


Now,  the  implementation  of  a  sequence  of  n  actions  is  straightforward.  For 
instance,  for  n  =  4,  we  have  two  “linear”  decompositions  of  (L/S\\  S2'.  63;  Si). 
The  first  one  is 


({L/Sv,Li)  ||  (L1/S2;L2)  ||  (T2/53;54))  . 


The  second  one  is 


((I/X2;S4)||  {L2/Li\Sz)  ||  (i1/51;52)). 

These  two  decompositions  lead  to  the  linear  implementations  shown  in  Fig¬ 
ure  5.9. 
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Figure  5.9:  Implementations  of  (L/fq;  S2;  S3;  S4) 


5.12.3  The  Passive-active  Buffer 

In  order  to  compose  one-place  buffers  in  a  linear  chain,  one  channel  must  be 
active  and  the  other  one  passive.  We  implement  the  buffer  with  L  passive  and 
R  active.  This  version  is  denoted  by  B(Lp,Ra)-  In  order  to  take  advantage 
of  the  active-active  case,  we  decompose  the  buffer  into  two  processes  q  and  t: 

q=  *[D'\R] 
t  =  (D/L)  . 

Process  q  is  an  active-active  buffer.  The  compilation  of  t  is  straightforward. 
The  handshaking  expansion  gives: 

*[[di];  [Zi];  Jof;  [—»/*];  lo  [;  do|;  [-«£*];  do  I]  . 

Since  D  is  an  internal  channel,  we  can  reshuffle  the  sequence  [-iZt];Zoj,  with 
respect  to  D  without  introducing  deadlock.  (Also  observe  that  since  do  J, 
remains  the  last  action  of  the  sequence,  we  have  not  changed  the  order  of  L 
relative  to  R.)  We  get 

*[[<&];  [Zi];  Zo|;  dof;  [ — >cZ*];  [“d*];  Zo|;  do[ ]  . 
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The  PR  expansion  leading  to  the  circuit  of  Figure  6  is 

di  A  li  i— ►  lot, do | 

-idi  A  -d*  t— +  lo  J.,  do  l  . 

Process  t  is  used  to  connect  the  two  ports  of  a  channel  when  they  are  both 
active.  It  is  called  a  “passive-passive  adaptor”.  The  complete  circuit  is  shown 
in  Figure  5.10. 


i - 1 


Figure  5.10:  An  implementation  of  the  passive-active  buffer 


The  passive-active  buffer  can  be  compiled  directly  by  introducing  a  state 
variable.  The  circuit  obtained  is  slightly  different.  See  [9]. 
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Chapter  6 

Case  Study:  Two 
Arbitration  Problems 

6.1  Introduction 

In  this  chapter,  we  construct  circuits  for  two  difficult  control  problems  in¬ 
volving  arbitration  among  asynchronous  events.  These  examples  show  how 
to  introduce  the  two  standard  building  blocks  for  arbitration  circuitry,  the 
arbiter  and  the  synchronizer. 

The  first  example  addresses  the  issues  of  arbitration  between  guards  and 
unstable  guards.  We  have  already  discussed  the  metastability  property  of 
arbiters.  But  the  realization  of  a  delay-insensitive  arbiter  raises  another  issue; 
fairness.  An  arbiter  is  strongly  fair  when  a  pending  communication  request 
is  granted  after  a  bounded  number  of  other  requests  are  granted.  An  arbiter 
is  weakly  fair  when  a  request  is  granted  after  a  finite  number  but  possibly 
unbounded  number  of  other  requests.  Whether  it  is  possible  to  construct  a 
delay-insensitive  fair  arbiter  has  been,  so  far,  an  open  question.  It  has  been 
conjectured  that  delay-insensitive  fair  arbiters  do  not  exist.  In  this  example, 
we  prove  the  existence  of  delay-insensitive  fair  arbiters  by  constructing  one. 


6.1.1  A  Fair- Arbiter  Program 

The  process  fsel  described  in  the  first  part  defines  a  fair  arbitration  program 
between  two  unrelated  inputs.  We  choose  to  implement  the  following  simpli¬ 
fied  version  of  fsel : 

*[[A  — >  Aj-iA  — »  skip];  [ B  — ►  B|-tB  — »  skip]]  .  (6.1) 
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According  to  6.1,  when  A  holds,  A  will  be  completed  after,  at  most,  one  B 
action,  whatever  the  current  state  of  the  computation  is.  Hence,  the  arbiter  is 
strongly  fair  towards  requests  A  and  B.  Assume  that  A'  is  pending  at  a  certain 
point  of  the  computation.  By  definition  of  the  probe,  A  is  true  eventually; 
i.e.,  a  finite  but  unbounded  number  of  B  actions  can  be  completed  between 
the  moment  q A'  holds  and  the  moment  A  holds.  Hence,  the  arbiter  is  only 
weakly  fair  towards  requests  A!  and  B' . 

Therefore,  with  this  definition  of  suspension  of  an  action ,  we  can  say  that 
the  arbiter  is  strongly  fair  towards  requests  that  have  reached  the  arbiter 
and  weakly  fair  towards  all  requests.  (We  could  redefine  the  suspension  of  a 
communication  action  X  such  that  qX  holds  only  when  the  initiation  of  action 
X  can  be  observed  by  the  other  process.  With  this  definition  of  suspension, 
we  have  q A'  =  A.  The  arbiter  is  then  strongly  fair  towards  all  requests.) 


6.1.2  The  Compilation 

Applying  the  process  decomposition  rule,  we  decompose  6.1  into  three  pro¬ 
cesses  (PI  ||  P2  jj  P3).  Channels  (C,  D)  between  PI  and  P2,  and  (E,  F) 
between  PI  and  P3  are  introduced. 

Pl=  *\E;C]  _ 

P2=  *[[D/\B^B:D 

[PA  -ijB  — ♦  D 

]}_ 

P3==  *[[FAA->  A;F 
[  F  A  -<A  — ►  F 
]]  • 

Ports  D  and  F  are  implemented  as  passive;  ports  C  and  E  are  implemented 
as  active.  Hence  PI  is  the  standard  active-active  buffer.  The  handshaking 
expansion  of  P 2  gives: 

P2  =  *[[d?  A  bi  — +  bo  "f;  [— >6i];  feoj;  do  I;  [— icii];  do  J. 

]  di  A  -i bi  — »  do  [—id*];  do  j 
]]  ■ 


Because  bi  can  change  from  false  to  true  asynchronously,  the  second  guard 
of  P 2  is  not  stable;  i.e.,  its  value  can  change  from  true  to  false  at  any  time. 
In  order  to  make  both  guards  of  P2  stable,  we  introduce  the  synchronizer 

sync  =  *[[di  A  bi  — *  u  [-id*];  wj 

\di  A  —ibi  — *  v  t;  [—id*];  uj, 

]]  • 
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sync  is  a  standard  operator  that  we  have  described  in  Part  I.  We  now  have 
to  find  a  process,  X,  such  that  Since  sync  is  entirely  defined,  we  would  like 
to  be  able  to  perform  the  inverse  operation  of  ||,  or  “process  quotient”,  so 
as  to  compute  X  as  X  —  (P 2  -5-  sync )  .  A  way  to  perform  this  quotient  is 
to  remove  all  actions  of  sync  from  P 2,  and  then  to  check  whether  the  result 
fulfills  (X  [j  sync)  =  P2. 

To  perform  the  quotient  as  suggested,  P 2  should  be  extended  to  contain 
all  actions  of  sync,  so  that  the  orders  of  actions  are  compatible  in  sync  and 
in  the  extended  version  of  P2.  (This  procedure  is  explained  in  [10].)  The 
extension  of  P 2  gives: 

*[[d*  A  bi  — >  uf;  [a];  boR  [—■&*];  6o|;  do f;  [-id*];  u  [\  [-m];  do  J, 

]  di  A  -<bi  -»  v  T;  [a];  do  T;  [-id*];  u|;  [->«];  do[ 

]]  • 

We  obtain  for  X: 

*[[u— >bof;  [-ib*];  boj;  do  t;  [— iti];  do  [ 

|u-^dot;  [— it»];  do{ 

]]• 


The  compilation  of  the  first  guarded  command  is  facilitated  if  transition  bo  J.  is 
postponed  until  after  [—it*].  This  transformation  does  not  introduce  deadlock 
since  the  completion  of  D  does  not  depend  on  the  completion  of  B.  After  this 
transformation,  the  PR  expansion  gives: 

j — j-  bo  f 
u  A  -ib*  i->  do  f 
bi  V  -m  do  l 
-in  bo  J. 

v  H4  do  f 
-iv  do  l  . 


The  operator  reduction,  which  includes  introducing  auxiliary  variables  do' 
and  do” ,  gives 


u 

(u,  -i bi) 


v 

(do',  do") 


wbo 
A do' 
w_do" 
\/_do  . 


The  circuit  is  shown  in  Figure  6.1.  The  implementation  of  P3  is  identical. 


6.1.3  The  Circuit 

The  final  circuit,  shown  in  Figure  6.2,  is  obtained  by  composing  the  two  iden¬ 
tical  circuits  implementing  P2  and  P3  with  the  circuit  of  PI.  The  reshuffled 
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bo 


bi 


Figure  6.1:  Implementation  of  P2 


version  of  PI,  consisting  of  a  wire  and  an  inverter,  can  also  be  used  if  it  can  be 
proved  that  the  reshuffling  does  not  introduce  deadlock.  The  circuit  shown  in 
Figure  6.2  includes  a  minor  optimization  that  eliminates  the  negated  inputs 
that  are  also  the  output  of  a  fork. 

Notice  that  the  solution  can  be  immediately  generalized  to  an  arbitrary 
number  of  requests. 

6.2  Distributed  Mutual  Exclusion 

The  first  paper  describing  this  method  for  the  synthesis  of  asynchronous  cir¬ 
cuits  from  high-level  description  was  presented  at  the  1985  Chapel  Hill  Con¬ 
ference  on  VLSI  [8].  The  example  used  to  illustrate  the  method  was  the 
algorithm  for  distributed  mutual  exclusion  on  a  ring  of  processes  described  in 
Chapter  2. 

Unfortunately,  the  circuit  presented  in  the  Chapel  Hill  paper  is  not  entirely 
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bo 

-•W 


Figure  6.2:  Implementation  of  the  fair  arbiter 


correct:  A  glitch  may  appear  on  the  wire  named  2  in  the  paper.  The  error 
is  due  to  my  not  following  the  compilation  procedure  when  I  defined  the 
variable  z.  The  error  was  noticed  by  many  people,  and  the  actual  CMOS 
implementation  of  the  circuit  realized  by  Andy  Fife  the  same  year  is  entirely 
correct. 

However,  I  never  took  the  time  to  publish  the  correct  solution,  and  there¬ 
fore  the  bug  has  been  rediscovered  over  and  over  again,  sometimes  with  great 
publicity [4],  Since  several  people  have  asked  me  to  show  them  a  correct  deriva¬ 
tion  of  the  circuit,  here  it  is  after  five  years! 

As  in  the  original  paper,  we  observe  that  the  two  consecutive  D  commands, 
and  the  two  consecutive  U  commands  can  both  be  implemented  as  the  two 
halves  of  a  4-phase  handshaking  protocol;  and  therefore  we  can  replace  the  two 
U  commands  with  one  single  U  to  be  implemented  as  a  4-phase  handshaking 
protocol. 

Next,  we  decompose  process  m  into  two  processes  A  and  B  as  follows: 
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A  =  *[[P  —  P;  Bt;  U 
\~L  — »  P;  Bf;  L 
]] 

B  =  *[  [Q  A  b  — >  Q 

[  Q  A  -lb  — *  R;  Q 
fS_^br,S 
S  T^bl-T 
]] 

The  internal  channels  between  A  and  B  are  (P,  Q ),  (Bt,  S ),  and  ( Bf,T ). 
The  technique  used  to  obtain  A  and  B  is  the  standard  process  decom¬ 
position,  with  one  addition.  The  plain  process  decomposition  would  give  a 
process  A  with  U  before  Bt,  and  L  before  Bf.  We  have  inverted  the  order  of 
these  actions,  since  it  is  semantically  irrelevant  whether  the  assignment  to  b  is 
the  last  action  of  the  guarded  command  provided  the  assignment  follows  the 
selection  command.  The  reason  for  this  transformation  is  that  the  program 
in  which  V  and  L  are  the  last  actions  of  the  guarded  commands  is  easier  to 
implement.  This  point  will  be  further  explained  in  the  compilation  of  A. 

6.2.1  Compilation  of  A 

Since  the  guards  U  and  L  are  not  mutually  exclusive,  we  are  introducing  an 
arbiter  described  by  the  program: 

Arb  =  *[  \ui  — *  vl  f;  [— iirz];  u' l 

| li  -*■  r t;  H it];  Vi 
]]■ 

We  know  that  A  =  (Arb\\A’),  where  A'  = 

Exercise  Prove  the  correctness  of  the  above  result.  Q 
Hence: 

A'  =  *[  [it'  — >  poI;\pi];po  i;  [bti\;bto  [—«&**];  uo  T;  J. 

\V  po1;\pi);pol;  [-ip*];  bfo  f;  [bfi];  bfo 

]] 

6.2.2  Mutual  exclusion  among  guarded  commands 

The  main  problem  in  implementing  A'  is  to  enforce  the  mutual  exclusion 
between  the  two  guarded  commands  (GCs).  By  construction  of  the  arbiter 
circuit  Arb ,  we  know  that — provided  that  -nr'  A  -iV  holds  initially — -nr'  V  — il' 
holds  at  any  time.  Hence,  the  mutual  exclusion  between  the  guards  of  A  is 
guaranteed. 
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However,  as  soon  as  v!  \  is  completed,  the  first  GC  of  the  arbiter  can 
complete,  the  second  GC  of  the  arbiter  can  start,  and  consequently,  the  PR 
set  implementing  the  second  GC  of  A'  can  start  firing,  even  though  the  first 
GC  of  A'  may  not  be  completed.  We  shall  see  that,  in  order  to  enforce  the 
mutual  exclusion  between  the  implementations  of  the  two  GCs  of  A\  it  is 
advantageous  to  postpone  u'  j.  as  long  as  possible.  This  explains  our  decision 
to  modify  A  such  that  U  is  the  last  action  of  the  first  GC,  and  L  the  last 
action  of  the  second  GC. 


6.3  First  Solution 

We  slightly  reshuffle  the  actions  of  A'  as  follows: 

A'  =  *[{u'  — ►  po|;  [pt];  po  [->p*];  bto  f  uo  T ;  [->«'];  bto[;  ;  uo 

I*'  ->pe»T;  \pi};pol;  hH;  bf° T;  [W*];Z°t;H'];  bf°  i;  [-■&,/»];  Jo  1 
]] 

We  first  ignore  the  transitions  on  bto,  bti,  hfo ,  and  bfi,  and  implement  the 
program: 

A'  =  *[[«'  ->poT;  [pt];  po J,;  [-ipt];  uo t;  [->«'];  uo J, 

-*po t;  H;  p°I;  hp*];  l°V,  N'];  loi 

}] 

Each  guarded  command  is  a  Q-element.  The  transitions  on  bto ,  bti,  hfo,  and 
bfi  are  added  by  just  “opening”  the  wires  uo  and  lo,  respectively. 

For  mutual  exclusion  between  the  implementations  of  the  two  guarded 
commands,  the  guard  u'  is  strengthened  as  u'  A  -do,  and  the  guard  l'  is 
strenghened  as  l'  A  ->uo. 

6.3.1  Merge 

We  now  have  to  compose  the  circuit  implementing  the  first  GC  with  the 
one  implementing  the  second  GC.  This  composition  is  a  little  more  than 
mere  juxtaposition  because  the  two  circuits  use  the  variables  pi  and  po.  The 
standard  way  to  deal  with  this  case  is  to  compose  the  two  circuits  with  a 
merge  circuit. 

We  replace  P  with  PI  in  the  first  GC,  and  with  P 2  in  the  second  GC,  and 
add  the  merge  process: 

*[[Pl->Pl*P 
| P2  -*  P2.P 
]] 
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The  handshaking  expansion  gives: 

*[[pii->pot;  H;  p!°  T;  hpi*];  v°i\  hpl*];  plot 

Jp2*  — »  po t ;  H;  P2°T;  [->p2i];  pop,  hp2*];  p2ol 

]] 

The  production  rule  expansion  gives: 

pli  V  p2i  *->  po  | 
pi  A  pli*-*  plot 
—■pli  A  — ip2i  e-t-  po  l 
-ipi  >-+  plo  J. 
pi  A  p2i  ■—►  p2o  t 
-i pi  i-»  p2o  I  . 

The  operators  are  the  or-gate  (pli,p2i)  V  po,  and  the  two  asymmetric 
C-elements  (pi;pli)  aC_plo  and  (pi-,p2i)  aC_p2o. 

6.3.2  Circuit  for  A’ 

Composing  the  merge  circuit  and  the  circuits  for  the  two  guarded  commands 
lead  to  an  implementation  of  A'.  But  we  make  two  observations.  First,  the 
asymmetric  C-elements  in  the  merge  are  not  needed  in  this  case.  Second,  and 
more  importantly,  we  realize  that  instead  of  merging  the  two  circuits  after  the 
two  Q-elements,  we  could  merge  them  before  the  Q-elements  so  that  the  two 
circuits  could  share  the  same  Q-element.  This  transformation  is  formalized 
by  the  following  program  decomposition.  We  have  A'  =  (A1||Q),  with: 


Al  =  *[  [u'  A  -do  — v  po' t;  \pi'\i  fetot;[6t*];uot;[-ia'];po'l;  [~'pi'\\  bto  J.;  [->&tj];wo  J, 
jjr  A -iuo  po"  V,  [pi}\bfo  T;  [bfi]-,lop,[-il']-,po"  p  [->pi'];  bfo  j.;  [nbfi]-  io  j, 
11 


Q  =  *[\po'vpo"  1;  pot;  [p*3;  p°4;  N»];  p*'T;  bpo' A-.po"];  pi' i) 

The  first  guarded  command  of  Al  is  compiled  as: 

u'  A  — iZcm — »  po'  | 
pi'  A  po'  bto  t 
bti  h  liol 
lo  V  ->u'  t-+  po'  l 
pi'  bto  1 
-i  bti  uo  l 
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The  operator  reduction  gives: 

(u',  — iZo)  A po' 

( pi';po ')  aCbto 
bti  w  no 

The  compilation  of  the  second  GC  of  A 1  is  similar. 

6.3.3  Compilation  of  B 

The  compilation  of  B  is  identical  to  that  of  the  original  paper.  The  hand¬ 
shaking  expansion  of  B  with  a  slight  reshuffling  of  the  actions  in  the  second 
GC  gives: 

B  =  *{  [qi  A  b  — >  qo  f;  [->g*];  qo  | 

Dgi  A  -lb  — ►  ro  T;  H;  qo  | ;  [-^qi}\  ro  qo  [ 

[|s*  — >  b  f ;  so  so  J, 

|  ti  — »  b  f;  to  f;  {— 'izj;  to  f 

]3 

We  first  observe  that  the  mutual  exclusion  between  the  guards  and  between 
the  guarded  commands  is  guaranteed.  The  production  rule  expansion  gives: 

qi  A  b  qo  f 
b  A  -i qi  i— >  qo  J. 
qi  A  rot 
ti  i— ►  qo~\ 

-iqi  t-+  ro  l 
-1  b  A  -i ri  i— *•  qo  J. 

The  conjunct  b  is  added  to  the  guard  of  the  first  PR  for  mutual  exclusion 
with  the  second  GC.  A  better  strengthening  of  the  two  rules  that  reset  qo  is 
-iqi  A  -<Ti  i-+  qo 

Combining  all  PRs  relative  to  qo  gives: 

qi  A  b  V  ri  t- >  qo  f 
-<qi  A  -i ri  i— >  qo  J, 

The  other  operator  is  (qi;  -> b)  aC_  ro.  The  production  rule  expansion  of  the 
last  two  GCs  is  straightforward.  It  gives: 

si  i— »  6  j 
ti  b  l 
si  A  b  i — *  so  j 
->  si  i— *  so  l 
ti  A  ->b  i— *  to  | 

-i  ti  t~>  to  l 
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The  set  of  operators  is: 

(si;  ~iti)  ff  b 
(si;  b)  aC_  so 
( ti ;  -ib)  aC_  to 

The  last  two  operators  can  be  replaced  with  the  and-gates  (si,  bi)Aso  and 
(ti,-ib)  A  to  by  the  usual  symmetrization.  The  flip-flop  can  be  replaced  with 
the  C-element  (si, ->ti)  C_b,  also  by  symmetrization. 

The  complete  circuit  is  shown  in  Figure  6.3. 


6.4  Exercise:  Implementation  without  reshuf¬ 
fling 

Can  we  implement  the  program  of  A  directly  without  postponing  U  and  LI 
We  have  to  implement  the  following  version  of  A: 

A  =  *[[U->P;  U;  Bt 
\L  — *■  P‘,  L;  Bf 
}] 


B  is  unchanged. 


A1  =  {Arb\\A'), 


where  Arb  is  unchanged.  A'  is  slightly  reshuffled. 


A'  =  *[  [«'  — >  po  t;  [pi]; no  [-m'];po  J.;  [-ipi];tto  J.;  bto  j;  [bti];  bto  [-ibfi] 

\l'  ->  po  t;  [pi];  lot;  H'];po|;  hpi];  lo  );  bfo  T;  [6^];  bfo  f;  [i bfi] 

]] 

Apart  from  the  opening  of  the  uo  wire  for  the  (po,pi)  connection,  the  first 
guarded  command  is  just  the  passive-active  buffer: 

*  [[u'] ;  uo  | ;  [-m'] ;  uo  J, ;  bt o  f ;  [bti] ;  bto  J. ;  [-> bti]  ] 

The  rest  of  the  compilation  is  left  as  an  exercise  to  the  reader. 
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Figure  6.3:  Circuit  for  a  server 
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Chapter  7 


Implementation  of  the 
Lazy  Stack 

7.1  Introduction 

The  design  of  the  stack  will  be  used  to  explain  the  general  method  for  imple¬ 
menting  communications  that  involve  passing  messages.  The  method  relies 
on  the  time-honored  “divide-and-conquer”  principle:  We  first  construct  the 
so-called  control  part  of  the  program,  which  is  the  original  program  in  which 
messages  have  been  removed  from  each  communication  action,  and  all  arith¬ 
metic  operations  have  been  replaced  by  procedure  and  function  calls  .  We 
then  combine  this  control  part  with  a  data  path,  which  is  a  collection  of  pro¬ 
cesses  implementing  the  assignment  parts  of  the  communication  actions  and 
the  functions  and  procedures  implementing  arithmetic  operations. 


7.2  The  Control  Part  of  the  Stack 

We  assume  that  the  stack  is  empty  initially.  We  introduce  the  channel  {t,  t'), 
so  that  F  can  be  called  from  within  E  by  process  decomposition.  We  get 

E=  *[[in  — *  inlx;  t 

|  out  — *  getlx;  out\x 

]] _ 

F  =  *[[t'  A  in  — >■  putlx;  inlx 
1 1'  A  out  —*  out\x\  t' 

]]  • 


87 


88 


CHAPTER  7.  IMPLEMENTATION  OF  THE  LAZY  STACK 


The  control  part  of  the  stack  consists  of  programs  E  and  F.  from  which 
message  communication  has  been  removed.  We  get 

E  =  *[[in  — >  in;  t 

I  out  — *  get :  out 

]] _ 

F=  *[[t'  A  in  -+  put;  in 
J  t'  A  out  —*  out;  t' 

3]  • 

In  the  handshaking  expansion,  we  let  the  choice  of  active  and  passive  com¬ 
munications  be  dictated  by  the  occurrence  of  the  probes.  (However,  we  will 
return  to  this  choice  later.)  We  get 

E=  *[[im— MJiot;  [— >ini] ;  inol;  top,  [ti];  top  [— 

|  outi  — >  geto  f ;  [greti];  geto  J,;  [-igeti];  onto];  [~iouti];  outo  l 

31 

F  =  *[[ti'  A  ini  —*  puto  p,  [put*];  puto  p  [~^puti\;  inop,  [-uni];  inol 
A  outi  —>  outop,  [-i outi];  outop,  to'T;  [->&'];  to' J, 

]]  • 

7.2.1  Compilation  of  E 

The  first  guarded  command,  El,  is  a  standard  passive-active  buffer.  The 
second  guarded  command,  E2,  is  a  standard  Q-element.  The  implementation 
of  E  must  combine  the  implementations  of  El  and  E2  in  a  way  that  enforces 
mutual  exclusion  between  the  execution  of  El  and  that  of  E 2. 

Since  the  execution  of  in  and  that  of  out  are  mutually  exclusive,  it  suffices 
to  guarantee  that  when  in  is  completed  in  El,  E2  cannot  start  until  t  is 
completed.  On  the  other  hand,  we  are  sure  that  El  cannot  start  before  E2 
is  completed  because  outo  l  is  the  last  action  of  E2. 

In  order  to  prevent  E2  from  starting  before  El  is  completed,  we  have  to  in¬ 
troduce  an  extra  variable,  or  reshuffle  the  handshaking  expansions.  We  choose 
to  introduce  the  variable  z  (initially  true  )  in  the  handshaking  expansion  of 
El,  and  we  strengthen  the  guard  of  E2  with  2.  We  get 


El  =  z  A  ini  -+  ino];  z  p  [-iz];  ino  p  top,  [ ti ];  toj,;  zf  , 


E2  =  -iti  A  outi  A  z  — >  geto f ;  [ geti ];  geto p  [-igeti];  outop,  [-muti];  outo l  . 

It  turns  out  that  our  choice  for  variable  z  is  quite  fortunate  as  it  is  already 
an  internal  variable  of  El,  as  indicated  on  Figure  7.1. 
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Figure  7.1:  Implementation  of  the  first  g.c.  of  E  with  variable  z 


Now,  E2  cannot  start  until  z  f  is  completed,  i.e.,  until  El  is  completed.  For 
symmetrization,  we  also  weaken  -*outi  as  -loutiw ~<z.  Hence,  mutual  exclusion 
is  enforced  by  replacing  input  outi  with  the  arid-operator  (outi,  z)  A  outi'  in 
the  Q-element  implementation  of  E2.  This  gives  the  circuit  of  Figure  7.2  as 
an  implementation  of  E. 

7.2.2  Compilation  of  F 

The  compilation  of  the  first  guarded  command  FT  of  F  is  identical  to  that  of 
E2  with  the  appropriate  change  of  variables.  The  compilation  of  the  second 
guarded  command  F 2,  however,  can  be  simplified  by  reshuffling.  We  reshuffle 
the  handshaking  sequence  of  t'  in  F 2  as  follows: 

ti'  A  outi  — i  onto  t ;  to' T;  A ->outi\;  onto  j;  to  [ 

The  validity  of  this  reshuffling  stems  from  the  fact  that  we  do  not  reshuffle  the 
initiation  or  the  completion  of  action  t'  since  [ti'\  and  to'  j  are  not  reshuffled 
and  the  reshuffling  of  the  middle  two  actions  of  t'  does  not  introduce  dead¬ 
lock.  The  above  sequence  compiles  immediately  into  the  “forked”  C-element 
( ti',outi )  C_  {onto.  to').  The  reshuffling  guarantees  that  FI  cannot  be  started 
before  F 2  is  completed. 
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to 

ti 


geto 

geti 


Figure  7.2:  Implementation  of  E 


The  channels  in  and  out  are  used  both  in  E  and  F .  so  we  need  to  merge 
the  local  copies  of  in  and  the  local  copies  of  out  in  a  standard  way  that  we 
do  not  describe  here.  The  resulting  circuit  for  the  control  part  of  the  stack 
element  is  shown  in  Figure  7.3. 


7.3  Implementation  of  the  data  path 

We  now  have  to  extend  the  implementation  of  the  control  part  S 2  so  as 
to  obtain  an  implementation  of  the  whole  program  51.  We  want  to  leave 
52  unchanged  by  introducing  a  data  path  process,  P,  such  that  the  parallel 
composition  of  52  and  P  implements  51. 

The  channels  in ,  out,  get ,  put  of  52  are  renamed  in',  out' ,  get' ,  put'.  P  com¬ 
municates  with  52  via  in' ,  out',  get', put'  and  with  the  environment  via  in.  out, 
get,  put.  (See  Figure  7.4.) 

Let  C  be  a  channel  of  51,  and  C'  be  the  renamed  channel  of  52  to  which  C 
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corresponds.  For  (52  ||  P )  to  implement  51,  each  communication  on  C  must 
coincide  with  a  communication  on  C';  i.e.,  P  must  implement  the  so-called 
channel  interface  process 

Ic  =  *\C  •  C)  . 

Hence,  P  has  to  implement  the  four  channel  interfaces: 

*[  m'  •  inlx] 

*[out'  •  outlx] 

*{get'  •  getlx] 

*[put'  •  putlx]  . 

7.4  Implementation  of  Channel  Interfaces 

There  are  four  types  of  channel  interfaces,  depending  on  whether  the  port  is 
active  or  passive,  and  whether  the  communication  is  an  input  or  an  output. 

7.4.1  Input  Actions  on  a  Passive  Port 

We  want  to  implement  the  interface  Ic  for  action  CIx  on  the  passive  port  C. 
Ic  communicates  with  52  by  the  active  port  C".  and  with  the  environment 
by  the  passive  port  D.  Furthermore,  in  the  standard  double-rail  encoding 
technique,  the  two-wire  implementation  (ci,co)  of  C  has  to  be  interfaced  to 
the  three-wire  input  port  D  in  which  the  two  input  wires,  dil  and  di‘2.  are 
used  to  encode  the  two  values  of  the  incoming  message.  (See  Figure  7.5.) 

Ic  has  to  implement  an  interleaving  of  the  three  sequences: 

5cS*[ci'T;  [ co'];  ci'  1;  [-W]] 

Sd  =  *[[d*l  V  di2];  do  t;  [~>dil  A  -<di2];  do  J.] 

Sx  s  *[[dil  — ►  x  f;  [*]  |  di2  — >  x  f;  [-ix]]]  . 

We  first  interleave  sequences  Sc  and  Sd  so  as  to  implement  C'  •  D: 


*[[dil  V  di2}:  ci'  f;  [co'];  dot;  [_,dil  A  -idi2];  ci'  f;  [ico'];  do|]  .  (7.1) 

Next  we  interleave  (7.1)  and  Sx-  The  interleaving  has  to  ensure  that  the 
assignment  to  x  is  inserted  after  [«/]  so  that,  when  the  assignment  to  x  is 
performed  in  the  datapath,  communication  action  C  has  indeed  been  started 
in  the  control  part.  This  interleaving  is  the  final  specification  of  the  interface 
Ic: 


*[[dilVdi2];  ci'  T;  [co'  A  dil  — >  x  [x]|co'  A  dt2  —>  x  f;  [->x]]; 

dot;  [-'di\  A  ->d*2];  ci'  f;  [ico'];  dot]  ■ 


(7.2) 
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We  can  implement  (7.2)  directly  as  follows: 

dil  V  di2 1 — *  ci'  \ 
co'  A  dil  i — »•  a:  X 
co'  A  di2  wr) 
dil  A  a:  V  di2  A  ->3:  i->  do  | 

->dil  A  ->di2  ci'  { 

-i  co'  do  l 

We  can  aslo  decompose  (7.2)  into  standard  operators.  We  first  decompose 
(7.2)  into  the  two  sequences: 

*[[dil  V  di2];  ci' t;  {-'dil  A  ->di2];  ci'  J,]  (7.3) 

and 

*[[co' A  dil —>  a;  t;  [a;];  dot;  doj. 

|  co'  A  di2  -»  x  j;  {— >as];  do|;  [-ico'];  do  l  (7.4) 

]]  • 

Sequence  (7.3)  is  realized  by  the  operator  (dil,di2)  V  ci'.  We  factor  (7.4) 
so  as  to  isolate  the  register  part: 

( co\  dil)  aC  xl  --  *[[co'  A  dil];  sell;  [-<co'];  xl  J.] 

(co',di2)  aC_x2=  *[[co'Adi2];  x2{;  [_ico'];  m2 X] 

[xl,x2]  reg  {x,  do)  ~  *[[;rl s  f;  [sc];  do);  [— >scl];  do l 

\x2-^>x\,\  j-ix];  do);  [~ >sc2];  do l 
]]  • 

The  implementation  is  shown  in  Figure  7.6. 

7.4.2  Input  Actions  on  an  Active  Port 

For  port  C  active,  the  communication  variables  of  the  interface  Ic  remain  the 
same.  But  now  the  handshaking  expansions  of  C'  and  D  are  different,  since 
C'  is  passive  and  D  is  active.  We  get: 

Sc  =  *[[co'];  ci't;  bco'];  ci' 1] 

Sd  =  *[do  t;  [dil  V  di2];  do  );  [-tdil  A  — >di2]] 

Sx  =  *[[dil  — >  x  t;  [sc]  J  di2  — *  x  j;  [— >as]]]  . 

(Observe  that  Sx  is  not  changed.) 

An  interleaving  of  Sc  and  Sd  that  implements  C'  •  D  is  the  interleaving 
corresponding  to  two  wires: 


*[[co'];  do  I;  [dil  V  di2];  ci'  [— ico,j ;  do);  [-idil  A  — >di2];  ci'  J.]  . 
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As  to  the  implementation  of  the  assignment  to  x,  we  now  observe  that, 
since  C  and  D  are  active,  there  is  no  risk  that  the  assignment  to  x  be  started 
before  C  is.  The  interleaving  obtained  is: 

*[[co'];  doj;  [dil  —*  I  di2  -+  x  J,];  ,  . 

ci'  [-^co,];  doj.;  [— A  — irf«2];  ci'  |]  , 

which  can  be  factored  into  the  wire 

(co1  w  do)  =  *\[co'];  dot;  Hco'];  doj,] 


and  the  register 

(d*l,di2)  reg  (x,d')  =  *[[dil  — ♦  xp,  [x];  ci' t;  [“>d*l];  ci1  [ 

jjd*2  — ■ >  x  k  [-ix];  ci' t;  [— >d*2]:  ci' \, 

]]• 

The  implementation  of  the  interface  is  shown  in  Figure  7.7. 


7.5  Output  Actions 

In  the  case  of  an  output,  like  out\x  or  putlx,  the  implementation  turns  out 
to  be  the  same  for  passive  and  active  ports.  Given  the  same  nomenclature  as 
in  the  input  case,  port  D  is  now  implemented  with  two  output  variables,  dol 
and  do2,  and  one  input  variable  di.  Port  C'  is  not  changed.  The  rest  of  the 
derivation  is  straightforward  and  is  left  as  an  exercise  for  the  reader.  It  leads 
to  a  wire  and  a  read  operator,  which  we  have  introduced  in  the  implementation 
of  the  register. 


di  w_ci  =  *[[di];  d'  [— id*];  ci' !] 

(co'.x)  read  (dol,do2)  =  *[  [x  A  co'  — >  dol  j;  [-ico'];  dolj 

[-ix  A  co'  — >  do2  T;  [-ico'];  do2  J. 

11- 

The  only  difference  between  the  active  and  the  passive  cases  is  that,  in  the 
active  case,  the  read  is  activated  first.  In  the  passive  case,  the  wire  is  activated 
first.  The  circuit  is  shown  in  Figure  7.8. 

7.5.1  Active  Input  and  Passive  Output 

A  somewhat  surprising  result  of  this  implementation  of  input  and  output  com¬ 
mands  is  that,  contrary  to  common  belief,  it  is  simpler  to  implement  input 
commands  with  active  ports  than  with  passive  ports.  The  gain  is  quite  im¬ 
portant:  For  n  bits  of  data,  the  active  implementation  saves  2  x  n  asymmetric 
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C-elements  and  n  or-gates.  On  the  other  hand,  the  implementation  of  output 
actions  is  the  same  for  active  and  passive  ports. 

Therefore,  we  shall  always  implement  input  actions  with  active  ports. 
When  the  input  port  is  probed,  like  in  in  the  stack  example,  we  shall  use 
a  slightly  more  complicated  handshaking  protocol  that  makes  it  possible  to 
probe  an  active  port.  A  simple  version  of  this  protocol  consists  of  replac¬ 
ing  the  single  passive  communication,  say  in,  with  two  communications  in  1 
and  m2,  with  ml  passive  and  probed,  and  m2  active  and  used  for  the  input 
action.  The  two  handshaking  expansions  are  usually  interleaved  as  follows: 

ini  .  ino  | ino  J. 

is  replaced  with 

mli  — ►  . .  .mloT;  [m2?'];  m2of;  [itnli];  mloj;  [-im2t];  m2oJ. 

(In  the  implementation  of  the  microprocessor,  we  have  used  a  more  efficient 
version  of  this  protocol.) 

7.6  The  Complete  Circuit  for  the  Stack 

The  sharing  of  register  x  by  ports  in  and  get  has  to  be  implemented  either 
by  a  multiplexer  or  by  a  multiport  flip-flop.  Since  only  two  ports  share  the 
register,  we  choose  to  use  a  dual-port  flip-flop.  The  complete  data  path  is 
shown  in  Figure  7.9. 

The  complete  circuit  obtained  by  composing  the  different,  parts  together 
is  shown  in  Figure  7.10.  An  important  optimization  has  been  added  to  the 
design.  It  concerns  the  implementation  of  the  second  guard  of  E: 

out  — >  getlx ;  outlx. 

We  observe  that  the  value  of  x  involved  in  the  second  action  (outlx)  is  the 
same  as  the  value  of  x  involved  in  the  first  action  (getlx).  We  can  therefore 
replace  it  with 

out  — >  outl(getl). 

The  handshaking  expansion  is: 

outi  — *  geto  f ;  [getil  — >  outol  ) \geti2  — »  outo2  f];  [-> outi]; 
geto  I;  [-igetil  — *  outol  l§-<geti2  — *  outo2  f] 

The  implementation  is  the  three  wires  outi  w  geto,  geti  1  w  outol,  and 
geti2  w  outo2. 

The  above  modification  leads  to  a  significant  simplification  of  the  circuit 
since  we  can  eliminate  a  D-element,  and,  for  each  bit  of  the  data  path ,  we 
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can  eliminate  an  IF-element  and  replace  the  multiport  flip-flop  with  a  simple 
flip-flop.  The  chip  we  have  fabricated  includes  this  modification,  as  well  as 
the  optimization  that  consists  in  making  input  port  in  active. 
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Figure  7.3:  The  control  part  of  the  stack  element 
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Figure  7.5:  Channel  interface  for  input  port 
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Figure  7.8:  Output  action  interface 
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Figure  7.9:  The  complete  data  path 


Chapter  8 

Asynchronous  Adders 


8.1  Introduction 

The  purpose  of  this  chapter  is  to  describe  the  design  of  an  asynchronous  ripple- 
carry  adder  as  an  illustration  of  the  transformations  and  design  decisions  that 
play  a  role  in  the  construction  of  asynchronous  VLSI  circuits  for  arithmetic 
functions. 


8.2  Function  Evaluation 

The  evaluation  of  a  function,  say,  F(X),  usually  appears  in  a  program  in 
the  form  Y  :=  F(X ),  ie,  the  function  is  evaluated  and  its  value  is  assigned 
to  a  variable  V.  The  first  program  transformation  we  perform  consists  of 
separating  the  function  evaluation  from  the  assignment.  Let  P  be  the  program 
containing  the  assignment  Y  :=  P(X).  We  apply  the  transformation: 

P>(P^F{X)[\(D/C\F(X))]\*[C1Y]) 


or,  alternatively: 

P»(pZ=f(x)\\{D/C7Y)\\*[C\F(X)}) 

(C  and  D  are  channels  introduced  for  process  decomposition.  We  use  the 
same  global  name  for  the  two  ports  of  the  same  channel.) 

The  two  alternative  decompositions  are  equivalent.  We  leave  it  to  the 
reader  to  check  that  whatever  decomposition  and  handshaking  expansion  are 
used,  they  will  contain  the  process  *[C!P(V)]  with  C  passive,  or  a  hand¬ 
shaking  expansion  equivalent  to  that  process  up  to  the  renaming  of  variable 
ci. 
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First,  we  briefly  discuss  the  general  approach  to  the  implementation  of 
this  process.  The  handshaking  expansion  has  to  be  a  generalization  of  the 
handshaking  expansion  of  the  bare  passive  communication  action  C: 

*[[ci];  cot;  [-»«];  col] 

with  ci  and  co  boolean,  and  the  environmemt  implementing  either  the  lazy- 
active  protocol: 

*[[-ico];  ci  t;  [co];  ci  1] 

or  the  usual  active  protocol  Initially,  ci  and  co  are  false. 

The  generalization  requires  that  the  single  boolean  co  be  replaced  with  a 
set  C  of  booleans,  and  the  two  assignments  co  t  and  co  |  with  two  multiple 
assignments  to  the  elements  of  C,  denoted  C  ft  and  C  !1,  respectively. 

The  two  predicates  -ico  and  co  as  used  in  the  wait-actions  of  the  environ¬ 
ment  have  to  be  replaced  by  two  general  predicates  z(C) — for  “zero” — and 
v(C) — for  “valid” — ,  respectively,  such  that  ->z(C)  V -iv(C)  is  invariantly  true. 
A  value  of  C  for  which  v(C)  holds  is  called  a  valid  value.  A  value  of  C  for 
which  z(C)  holds  is  called  a  zero  value. 

Furthermore,  the  two  assignments  C  ft  and  C  ft  fulfill  the  requirements: 

{*(C)}C  ft  {*(<?)} 

and 

{u(c)}ciHz(c)}, 

so  that  the  protocol  between  the  process  and  the  environment  can  now  be 
described  by  the  two  handshaking  expansions: 

*[[«];  C  ft;  [~>ci];  C  ft] 

and 

*[[z(C)];  ci}\  [«(<?)];  ci  []  . 

Initially,  ci  is  false,  and  z(C)  holds. 

8.2.1  Delay-Insensitive  Codes 

The  code  for  C  must  fulfill  the  following  requirements. 

•  All  values  that  can  be  transmitted  on  the  channel,  typically  all  integers 
from  0  to  2^  —  1  for  some  given  N,  can  be  coded  as  valid  values,  and 
at  least  one  zero  value  of  C  can  be  coded  such  that  -<z(C)  V  -> v(C)  . 

•  When  C  is  assigned  a  valid  value  by  the  concurrent  assignment  C  ft,  no 
intermediate  value  taken  by  C  during  the  assignment  is  a  valid  value. 
(Otherwise,  the  wait  action  [u(C)]  of  the  environment  could  be  com¬ 
pleted  too  early.) 
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•  Symmetrically,  when  C  is  assigned  a  zero  value  by  the  concurrent  as¬ 
signment  C  JJ-,  no  intermediate  value  taken  by  C  during  the  assignment 
is  a  zero  value.  (Otherwise,  the  wait  action  [z{C)\  of  the  environment 
could  be  completed  too  early.) 

•  Finally,  a  “side-effect”  of  the  assignment  C  ft  is  to  assign  to  C  a  (valid) 
value  whose  numerical  interpretation,  say  i'(C),  is  such  that  v{C)  = 

F(X). 

8.2.2  Dual-rail  Code 

There  are  many  codes  for  C  that  fulfil  the  above  requirements.  A  simple  and 
popular  one  is  the  so-called  dual-rail  code.  If  C  represents  an  IV-bit  integer, 
each  bit  is  coded  with  two  boolean  variables.  Bit  c*.  is  coded  with  ctk  and 
cfk ■  ctk  is  the  “true  bit”  of  Ck  and  is  set  to  true  when  cj,  has  to  be  set  to  true, 
cfk  is  the  “false  bit”  of  Ck  and  is  set  to  true  when  ck  has  to  be  set  to  false. 
We  have: 

z{C)  =  (/\  k  "  -‘Ctk  A  -ic/fc) 
v(C)  =  {/\  k  ::  ctk  A  -ic/*.  V  cfk  A  -i Cfc) 
v(C)  =>  (Vfc  ::  Ck  =  ctk) 

(In  this  paper,  all  quantifications  range  from  0  to  IV  —  1,  with  N  >  0.)  Since 
(/\fc  ::  (~>ctk  V  -i cfk)}  is  maintained  as  an  invariant,  v(C)  is  implied  by  the 
simpler  condition: 

(/\  k  ::  ctk  V  cfk)  . 

Observe  that  there  is  only  one  zero  value  of  C ,  namely,  f\  k  ::  -> ctk  A  -tc/j., 
and  there  are  many  values  of  C  that  are  not  valid  and  not  zero. 

8.2.3  Stable  versus  Communicated  Inputs 

The  implementation  of  C\F(X)  described  so  far  relies  on  the  assumption  that 
when  ci  holds,  the  input  X  has  the  valid  value  for  the  evaluation  of  F  and 
that  X  is  not  changed  through  the  evaluation  of  F.  We  say  that  the  input  is 
stable. 

An  alternative  solution  consists  in  having  X  being  received  as  a  message 
on  channel  C,  and  F(X)  being  sent  as  a  message  on  the  same  channel:  C 
implements  the  swap  of  X  and  F(X).  In  that  case,  the  input  X  has  to  go 
through  the  valid/zero  cycle  and  is  dual-rail  encoded  (or  encoded  with  any 
other  delay-insensitive  code).  We  say  that  the  input  is  communicated.  The 
handshaking  expansion  of  C\F(X)  is  of  the  form: 


*IM*)]:  Cft;  [*(*)];  C  JJ.]  . 


(8.1) 


106 


CHAPTER .  8.  ASYNCHRONOUS  ADDERS 


The  handshaking  expansion  of  8.1  requires  that  all  boolean  inputs  of  X 
be  valid  before  any  elementary  assignment  of  C  fr  is  started,  and  that  all 
boolean  inputs  of  X  be  zero  before  any  elementary  assignment  of  C  !]•  is 
started.  Such  an  ordering  requirement  is  unnecessarily  strong.  The  following 
weaker  requirement — which  we  call  the  weak  handshake  rule  for  communicated 
input — is  sufficient: 

For  each  boolean  x  of  input  X ,  there  is  at  least  one  elementary  assignment 
cf  of  C  ft  such  that  {u(a;)}ct,  and  there  is  at  least  one  elementary  assignment 
c'  i  of  C  l  such  that  {z(i)}c'  J.. 

Hence,  each  boolean  input  variable  x  is  part  of  the  hanshaking  sequence: 

*[[v(*)l;  ct;  [z(®)];  d l]  .  (8.2) 

In  the  following  implementation  of  the  addition,  the  operands  of  the  addition 
are  assumed  to  be  stable  but  the  carry  inputs  are  communicated. 


8.3  Binary  Addition 

We  want  to  implement  the  process  *[5!(A  +  B }].  Its  handshaking  expansion 
is: 

*[[«];  5  ft  {u(S)  =  v(A)  +  v{B)};  M;  5  jj]  .  (8.3) 

A,  B ,  and  S  are  IV-bit  integers.  Inputs  A  and  B  are  assumed  to  be  stable, 
but  the  sum  S  is  dual-rail  encoded. 

Next,  we  need  to  refine  the  postcondition  n(5)  =  u(A)  +  1/(5)  in  terms 
of  relations  between  each  bit  of  S  and  the  corresponding  bits  of  A  and  B. 
There  are  many  ways  to  describe  these  relations,  each  corresponding  to  a 
particular  addition  algorithm.  Here,  we  choose  the  algorithm  that  is  usually 
called  “ripple-carry  adder.” 

8.3.1  Ripple-carry  Addition 

The  value  of  bit  sk  of  S  can  be  expressed  as  a  function  of  the  bits  ak  and  bk 
of  A  and  B,  and  of  the  carry-in  bit  Ck-  More  precisely,  the  postcondition  of 
the  addition  can  be  expressed  as: 

-iCo  A  (Vfc  ::  sum-k) 

where  each  sumu  is  the  conjunction  of  the  three  predicates: 


(-id*  A  -i&fe)  =S-  (sfc,Cfc+i  =  Ck,  false) 
(ak  A  bk)  =>  (sk,ck+i  =  ck,  true) 
(at  #  bk)  =»  ( sk,ck+i  ~  ~'Ck ,  ck ) 
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The  computation  of  bit  s k  of  the  sum  requires  the  previous  computation 
of  carry  bit  ck  and  therefore  also  produces  carry  bit  Cfc+i.  Hence,  the  carry 
bits  also  have  to  be  dual-rail  encoded  and  used  as  communicated  variables,  ie 
we  will  have  to  add  the  waits  [u(cfc)]  and  [z(ck)]  in  the  handsaking  expansion. 

We  can  easily  design  the  program  addk  that  establishes  sumk .  Its  inputs 
are  a*,  and  bk,  and  the  carry-in  bits  ctk  and  cfk .  Its  outputs  are  stk  and  sfk 
and  the  carry-out  bits  ctj.+i  and  cfk+i-  We  get: 

addk  =  [  ~>ak  A  ->  ([ ctk  -»  stk  || cfk  ->  sfk  T])  ||  cfk+i 1 

1  ak  A  bk  ([ctfe  -+  stk  TQc/fc  ->  sfk  T])  ||  ctk+i  t 

|  ?-  bfc  *  [ Ctk  —  Sfk  t,  Ctk+1  Tlc/fc  -♦  Stk  t,  Cfk+l  |j 

I 

(The  comma  is  used  as  an  alternative  to  ||  for  the  parallel  composition  of 
simple  assignments.) 

8.3.2  Handshaking  Expansion 

We  now  replace  S  ft  with  addk  in  (8.3)  and  implement  S  ft.  This  refinement 
gives: 


*[M;  <||fc  ::  adrffc);  [->si];  (\\k  ::  stk  l,sfk  l,ctk+i  l,cfk+i  l}]  .  (8.4) 

Furthermore,  the  first  carry-in  bits  are  generated  by  the  program: 

*[[si  ->  ct0  l,cf0  T 

ct0  l,cfo  I  . 


Next  we  have  to  enforce  the  weak  handshake  rule  of  (8.2)  for  the  inputs  ck. 

A  straightforward  solution  is: 

*[[»*];  {\\k"  [v(cfc)];ad(4);  [~,siA(/\  k  ::  z(cfc))];  (\\k  ::  stk  [,sfk  l,ctk+i  l,cfk+i  l)] . 

(8.5) 

Unfortunately,  this  solution  is  entirely  sequential  since  the  expressions 
v(ck)  become  true  in  the  order  of  increasing  k.  (Observe  that  including  the 
waits  for  n(cj;)  in  the  wait  for  si ,  as  [si  A  (f\k  ::  n(c*.))],  would  result  in  a 
deadlock  since  only  v (  Cq  )  holds  initially.)  But,  we  can  use  the  fact  that  the 
computation  of  each  bit  sk  in  addk  requires  that  v(ck)  hold.  We  can  therefore 
remove  the  explicit  wait  [n(cfc)j  from  (8.5)  and  still  fulfill  the  weak  handshake 
rule.  Now,  the  upgoing  part  of  the  computation  of  a  carry  bit  can  proceed 
without  waiting  for  the  previous  carry  bit  when  ak  =  bk.  Concurrency  in  the 
computation  of  the  carry  bits  also  introduces  concurrency  in  the  computation 
of  the  sum  bits.  However,  the  downgoing  part  of  the  computation  of  the 
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carry  bits  is  still  sequential  since  the  wait  for  (/\  k  ::  z(ck))  still  precedes  all 
downgoing  assignments.  We  improve  this  part  as  follows. 

We  first  apply  a  transformation  rule  that  takes  the  parallel  quantification 
| [ A*  ::  out  of  the  process.  This  transformation  results  in  replacing  the  single 
process  with  N  parallel  processes: 

(||A; ::  addk\  [-.s*  A  z(cfc)];  stk  i,s/fc  l,ctk+i  l,cfk+i  J.])  .  (8.6) 

Second,  we  replace  the  downgoing  sequence  of  (8.6) 

[-'si  A  z(cfc)];  stk  l,s fk  J.,cffe+:L  J.,c/*.+i  J. 


with  the  sequence 

[-•s*  ->  ctk+ 1  l,  cfk+i  1]  |S  hctfc  A  -i cfk  stk  J.,  sfk  I]  , 

in  which  we  have  implemented  z(ck)  as  ->ctk  A  ~>cfk.  The  weak  handshake 
rule  is  still  obeyed.  We  get  the  final  handshaking  expansion: 

(\\k  ::  *[[s'i];  addk]  ([->si  ->  ctk+i 1,  cfk+ 1  i]  ||  ~<ctk  A  cfk  -*  stk  j,  sfk  J.])]}  . 

(8.7) 

Now,  the  downgoing  part  of  the  carry-out  generation  can  proceed  without 
waiting  for  the  carry-in;  and,  as  we  mentioned  before,  the  upgoing  part  of 
the  carry-out  generation  can  proceed  without  waiting  for  the  carry-in  when 
ak  =bk.  This  optimization  of  the  carry-chain  length  is  the  main  characteristic 
of  this  type  of  adders. 


8.4  Implementation  of  the  Adder  Cells 

Each  program  of  (8.7)  is  called  an  adder-cell.  For  the  rest  of  the  implemen¬ 
tation  of  the  adder-cells,  we  can  omit  the  subscripts  k  and  A;  +  1.  The  input 
variables  are  a,  b,  and  ct  and  cf  for  the  carry-in  bits.  The  output  variables 
are  st  and  sf  for  the  sum  bits,  and  dt  and  df  for  the  carry-out  bits. 

We  first,  simplify  the  program  of  add  by  combining  the  guards  and  factoring 
the  parallel  composition.  We  get: 

add  =  ([  (-ta  A  —ib)  V  ((a  6)  A  cf  )  — *  df  j 

J  (a  A  b)  V  ((a  ^  b)  A  ct)  —>  dt] 

] 

||  [  ct  A  (a  =  b)  V  cf  A  (a  6)  — »  st  ] 

|  cf  A  (a  =  b)  V  ct  A  (a  y^  b)  —>  sf  ] 


) 
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The  complete  program  for  an  adder-cell  is: 

*[{si];  add;  ([— — *  dtl,df  !]  ||  [-ict  A  ->cf  — ►  st  [,sf  J.])]  (8.8) 

Next,  we  include  the  wait  [si]  into  the  guards  of  add,  it  a  guard  G  becomes 
G  A  si. 

The  program  of  an  adder-cell  becomes: 

adder  —  cell  =  *[([  (si  A  -ia  A  -ib)  V  ((a  ^  b)  A  cf )  — >  df  | 

|  (si  A  a  A  b)  V  ((a  ^  6)  A  ct)  —*  dt  ] 


||[ct  A  (a  =  6)  V  c/  A  (a  7^  6)  — >  st  f 
|  cf  A  (a  =  b)  V  cf  A  (a  ^  b)  — >  sf  ] 


-i  si  — + 


dt  i,df  |]  ||  [-■ct  A  -i cf 


StUsf  J.]) 


We  have  optimized  this  transformation  by  adding  si  only  in  the  terms  of  the 
guards  of  add  that  do  not  contain  ct  or  cf  since  we  can  prove  that 

(ctfc  =$>  si)  A  (cfk  =$>  si)  (8.9) 

holds  for  all  k.  The  proof  of  (8.9)  is  by  induction  on  k:  For  the  base  case 
k  =  0,  (8.9)  holds  obviously  because  of  the  program: 

*[  [si  >-»•  ct0 1,  c/0  t 
j->si  t-t  ct0  |,c/0  i  . 

]] 


For  the  induction  case,  we  prove  that  if  (8.9)  holds  for  k  it  holds  for  k  +  1. 
The  structure  of  the  guarded  commands  is  such  that  (dt  =>  (si  V  ct))  A  (df  =$■ 
(si  V  cf))  holds  as  a  postcondition  of  adder-cell.  But  since,  by  the  induction 
hypothesis,  (ct  =>  si)  A  (cf  =>  si)  holds  for  k,  we  have  established  (dt  => 
si)  A  (df  =>  si).  Since  dt  for  cell  k  is  ct  for  cell  k  +  1,  and  similarly  for  df  and 
cf,  (8.9)  is  established  for  k  +  1. 

8.4.1  Production-rule  Expansion 

We  add  a  minor  modification:  The  guards  of  df  |  and  dt  |  are  equivalent  to 
(and  can  be  replaced  with): 

(si  A  -id  A  -ib)  V  (-1  a  V  -ib)  A  cf 


and 


(si  A  a  A  b)  V  (a  V  b)  A  ct 
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respectively.  The  production-rule  expansion  is  now  straightforward: 


(si  A  -i a  A  -ib)  V  cf  A  (-i a  V  ->6) 
(si  A  a  A  6)  V  ct  A  (a  V  6) 

-is* 

(ct  A  a  =  b)  V  (c/  Aa^t) 
(c/  A  a  =  b)  V  (ct  A  a  ^  6) 
-ict  A-icf 


df  t 
dt  | 

dt-M/i 

st  f 

*/T 

st  |,s/  J. 


8.5  Implementation  Issues 

The  CMOS  gates  for  dt  and  st  are  shown  in  Figure  8.1.  We  use  dynamic  logic 
for  these  state-holding  gates  since  there  is  no  data-dependent  delays  between 
the  upgoing  and  downgoing  transitions.  As  usual,  the  logic  is  inverting,  and 
thus  the  gates  produce  the  complementary  signals  dt,  and  st.  of  dt  and  st. 
Adding  an  inverter  at  the  output  of  each  gate  that  produces  dt.  is  an  expensive 
solution  since  the  carry  chain  may  include  up  to  N  inverters  in  series  in 
addition  to  the  N  carry  gates.  A  better  solution  consists  in  alternating  gates 
that  produce  dt.  and  df.  (the  even-numebred  bits)  with  gates  that  produce 
dt  and  df  (the  odd-numbered  bits). 

Finally,  we  can  simplify  the  design  of  the  adder-cell  0:  We  can  eliminate 
inputs  cto  and  c/0  since  et0  is  identically  false  and  c/o  =  si.  The  simplified 
production  rules  are: 


si  A  (-ia  V  -ib) 
si  A  (a  V  b) 
-i  si 

cf  A  (a  ^  b) 
cf  A  (a  =  b) 
-|  cf 


i-*  df  T 
i->  dt  f 
dt  i,  df  i 
)-*■  st  | 

|-+  sf  | 

^  st[,sfl 
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Figure  8.1:  CMOS  implementation  of  the  true  bits  of  the  sum  and  carry 
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Chapter  9 

The  First  Asynchronous 
Microprocessor 


9.1  Introduction 

In  this  chapter,  we  describe  a  delay-insensitive  microprocessor  my  students 
and  I  designed  at  Caltech  in  the  fall  of  1988.  It  is  the  first  delay-insensitive 
or  even  asynchronous  microprocessor  ever  designed.  It  is  a  16-bit,  RISC- 
like  architecture.  The  version  implemented  in  1.6  micron  SCMOS  runs  at  18 
MIPS.  The  chips  were  found  functional  on  “first  silicon.” 

As  we  explained  in  Section  2.8.6,  the  processor  was  first  specified  as  a 
sequential  program,  which  was  then  transformed  into  a  concurrent  program 
so  as  to  pipeline  instruction  execution.  The  circuits  were  derived  from  the 
concurrent  program  by  semantics-preserving  program  transformation. 

The  design  was  undertaken  as  a  large-scale  application  of  the  high-level 
synthesis  method  for  asynchronous  VLSI  that  we  have  developed  in  these 
notes. 

The  results  of  the  experiment  can  be  summarized  as  follows.  First,  it  is 
possible  and  advantageous  to  describe  circuits,  even  of  the  size  and  complexity 
of  a  microprocessor,  in  a  high-level  program  notation.  With  the  exception  of 
the  ALU  function,  the  complete  program  takes  less  less  than  two  pages — let 
us  say  that  a  complete  description  including  all  functions  would  take  approx¬ 
imately  three  pages.  The  transformations  performed  on  the  initial  sequential 
program  to  introduce  pipelining  show  that  the  notation  is  appropriate  for  a 
designer  to  work  with  efficiently,  since  all  important  design  decisions  can  be 
made  at  the  level  of  source  code. 

Second,  it.  is  possible  to  derive  the  circuit  from  the  program  by  applying 
systematic  semantics-preserving  transformations,  and  to  obtain  a  circuit  that 
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is  correct  on  first  silicon.  The  compilation  procedure  is  not  described  here, 
but  can  be  found  in  several  papers,  in  particular  [6]. 

Third,  the  results  of  the  experiment  demonstrate  that  the  often  accepted 
“fatalities,”  that  formal  design  methods  and  asynchronous  techniques  lead 
to  inefficient  solutions,  are  simply  myths  fueled  by  the  natural  resistance  to 
change.  Not  only  is  the  processor  surprisingly  small  and  fast  for  a  first  design, 
but  it  also  exhibits  a  robustness  to  parameter  variations  that  goes  beyond  our 
expectations  and  almost  beyond  our  understanding:  One  of  the  two  versions 
seems  still  to  function  with  a  voltage  value  of  0.35V  for  the  VDD!  Maybe  the 
biggest  surprise  is  the  very  low  power  consumption  of  the  chips,  which  makes 
this  design  style  ideally  suited  for  use  in  highly  concurrent  architectures  where 
a  large  number  of  chips  are  tightly  packed. 


9.2  The  Processor:  The  Test  Results 

The  processor  has  a  16-bit,  RISC-like  instruction  set.  It  has  sixteen  registers, 
four  buses,  an  ALU,  and  two  adders.  Instruction  and  data  memories  are 
separate.  The  chip  size  is  about  20,000  transistors.  Two  versions  have  been 
fabricated:  one  in  2 pm  MOSIS  SCMOS,  and  one  in  1.6 pm  MOSIS  SCMOS. 
(The  dimension  refers  to  the  minimal  width  of  a  wire.)  On  the  2 pm  version, 
only  twelve  registers  were  implemented  in  order  to  fit  the  chip  on  the  84-pin 
6600pm  x  4600 pm  pad  frame. 

With  the  exception  of  isochronic  forks,  the  chips  are  entirely  delay-insensitive. 
The  circuits  use  neither  clocks  nor  knowledge  about  delays.  The  only  excep¬ 
tion  to  the  design  method  is  the  interface  with  the  memories.  In  the  absence  of 
available  memories  with  asynchronous  interfaces,  we  have  simulated  the  com¬ 
pletion  signal  from  the  memories  with  an  external  delay.  For  testing  purposes, 
the  delay  on  the  instruction  memory  interface  is  variable. 

In  spite  of  the  presence  of  several  floating  n-wells,  the  2pm  version  runs 
at  12  MIPS.  The  1.6pm  version  runs  at  18  MIPS.  (Those  performance  fig¬ 
ures  are  based  on  measurements  from  sequences  of  ALU  instructions  without 
carry.  They  do  not  take  advantage  of  the  overlap  between  ALU  and  mem¬ 
ory  instructions.)  Those  performances  are  quite  encouraging  given  that  the 
design  is  very  conservative:  It  uses  static  gates,  dual-rail  encoding  of  data, 
completion  trees,  etc. 

Only  two  of  the  12  2 pm  chips  passed  all  tests,  but  34  of  the  50  1.6pm  chips 
were  found  to  be  functional.  (However,  within  a  certain  range  of  values  for  the 
instruction  memory  delay,  the  1.6pm  version  malfunctions.  We  will  return  to 
this  phenomenon,  which  is  related  to  the  implementation  of  isochronic  forks.) 

It  takes  less  than  700  instructions  to  test  the  processors  for  stuck-at.  faults. 
The  program  counter  is  the  only  part  that  was  not  tested  exhaustively  because 
the  memory  used  for  the  test  did  not  contain  the  address  required  for  testing 
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the  most  significant  bit  of  the  program  counter. 


Figure  9.1:  MIPS  as  a  function  of  VDD 

We  have  tested  the  chips  under  a  wide  range  of  VDD  voltage  values.  At 
room  temperature,  the  2 p,m  version  is  functional  in  a  voltage  range  from  7V 
down  to  0.35V!  And  it  reaches  15  MIPS  at  7V.  We  have  also  tested  the  chips 
cooled  in  liquid  nitrogen.  The  2 pm  version  reaches  20  MIPS  at  5V  and  30 
MIPS  at  12V.  The  1.6/xm  version  reaches  30  MIPS  at  5V.  Of  course,  the 
measurements  are  made  without  adjusting  any  clocks  (there  are  none),  but 
simply  by  connecting  the  processor  to  a  memory  containing  a  test  program 
and  observing  the  rate  of  instruction  execution.  The  results  are  summarized 
in  Figure  9.1.  The  power  consumption  is  145mW  at  5V  and  6.7mW  at  2V. 


9.3  Specification  of  the  processor 

The  instruction  set  is  deliberately  not  innovative.  It  is  a  conventional  16-bit- 
word  instruction  set  of  the  load-store  type.  The  processor  uses  two  separate 
memories  for  instructions  and  data.  There  are  three  types  of  instructions: 
ALU,  memory,  and  program-counter  (pc).  All  ALU  instructions  operate  on 
registers;  memory  instructions  involve  a  register  and  a  data  memory  word. 
Certain  instructions  use  the  following  word  as  offset.  The  only  important 
omissions,  those  of  an  interrupt  mechanism  and  communication  ports,  are 
ones  we  found  to  be  unnecessary  distractions  in  a  first  design. 
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Figure  9.2:  Process  and  channel  structure 

9.4  Decomposition  into  Concurrent  Processes 

The  program  of  Section  2.8.6  is  further  decomposed  into  a  set  of  concurrent 
processes.  In  this  program  we  have  used  a  restricted  form  of  shared  vari¬ 
ables.  The  control  channels  Xs,  FIs,  ZAs,  ZWs ,  ZRs ,  and  the  bus  ZA  are 
one-to-many;  the  buses  X,  Y,  ZM  are  many-to-many;  the  other  channels  are 
one-to-one.  But  all  channels  are  used  by  only  two  processes  at  a  time.  The 
structure  of  processes  and  channels  is  shown  in  Figure  9.2.  The  final  program 
is  shown  in  Figures  9.3  and  9.4.  Process  FETCH  fetches  the  instructions 
from  the  instruction  memory,  and  transmits  them  to  process  EXEC  which 
decodes  them.  Process  PCADD  updates  the  address  pc  of  the  next  instruc¬ 
tion  concurrently  with  the  instruction  fetch,  and  controls  the  offset  register. 
The  execution  of  an  ALU  instruction  by  process  ALU  can  overlap  with  the 
execution  of  a  memory  instruction  by  process  MU.  The  jump  and  branch  in¬ 
structions  are  executed  by  EXEC ;  store-pc  is  executed  by  the  ALU  as  the 
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instruction  “add  the  content  of  register  r  to  the  pc  and  store  it.”  The  array 
REG\k]  of  processes  implements  the  register  file.  Both  MV  and  PC  ADD  con¬ 
tain  their  own  adder.  Processes  IMEM  and  DMEM  describe  the  instruction 
memory  and  data  memory,  respectively. 

9.4.1  Updating  the  PC 

The  variable  pc  is  updated  by  process  PCADD ,  and  is  used  by  IMEM  as  the 
index  of  the  array  imern  during  the  ID  communication — the  instruction  fetch. 

The  assignment  pc  pc  + 1  is  decomposed  into  y  :=  pc+l;pe  :=  y,  where 
p  is  a  local  variable  of  PCADD  .  The  overlap  of  the  instruction  fetch,  ID'! 
(either  IDH  or  ID? offset),  and  the  pc  increment,  y  :=  pc+  1,  can  now  occur 
while  pc  is  constant.  Action  IDI  is  enclosed  between  the  two  communication 
actions  PCI1  and  PCI2,  as  follows: 

PCIl-,IDIi-,PCn  . 

In  PCADD ,  y  :=  pc  +  1  is  enclosed  between  the  same  two  communication 
actions  while  the  updating  of  pc  follows  PCI2: 

PCI1  — ►  PCIl\y  :=  pc  +  l;PCI2;pc  :=  y  . 

Since  the  completions  of  PCI1  and  PCI2  in  FETCH  coincide  with  the  com¬ 
pletion  of  PCI1  and  PCI2  in  PCADD,  respectively,  the  execution  of  ID 'll  in 
FETCH  overlaps  the  execution  of  y  :=  pc  +  1  in  PCADD.  PCI1  and  PCI2 
are  implemented  as  the  two  halves  of  the  same  communication  handshaking 
to  minimize  the  overhead. 

In  order  to  concentrate  all  increments  of  pc  inside  PCADD,  we  use  the 
same  technique  to  delegate  the  assignment  pc  :=  pc  +  offset  (executed  by  the 
EXEC  part  in  the  sequential  program)  to  PCADD. 

The  guarded  command  Xof  — *  XI  offset  •  Xof  in  PCADD  has  been  trans¬ 
formed  into  a  concurrent  process  since  it  needs  only  be  mutually  exclusive 
with  assignment  y  :=  pc  +  offset,  and  this  mutual  exclusion  is  enforced  by  the 
sequencing  between  PCA1;  PCA2  and  Xof  within  EXEC. 


9.5  Stalling  the  Pipeline 

When  the  pc  is  modified  by  EXEC  as  part  of  the  execution  of  a  pc  instruc¬ 
tion,  ( store-pc ,  jump  or  branch),  fetching  the  next  instruction  by  FETCH  is 
postponed  until  the  correct  value  of  the  pc  is  assigned  to  PCADD.pc. 

When  the  offset  is  reserved  for  MU  by  EXEC,  as  part  of  the  execution  of 
some  memory  instructions,  fetching  the  next  instruction,  which  might  be  a 
new  offset,  is  postponed  until  MU  has  received  the  value  of  the  current  offset. 
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IMEM  = 
FETCH = 


PC  ADD  = 


EXEC 


*[I  DHmem\pc]] 

*[PCIl;IDU;PCI2-, 

[. offset(i.op )  — ►  PCI\\IDI offset-,  PCI2 
\~^offset(i.op)  — +  skip 
]-,Elli-,E2 
)  _ 

(*[[PCI1  ->  PCI\\y  pc  +  1;  PCI2;pc  :=  y 

1 PCA1  PC Al\y:=pc+ offset;  PC A2u>c:=y 

| Xpc  — *  X!pc  •  Xpc 
|  Fpc  — *  YIpc  •  Ypc 

L_ 

\\*[[Xof  -*  Xloffset  •  Xof]\ 

) 

*[£!?/; 

[olti(j.op)  — >  E2:Xs  *Fs  •  j4C!j.op*  ZAs 
|M(j'.op)  -*  £2;  Xs  •  Ys  •  MCI  •  ZRs 
ist(j.op)  -  F2;  Xs • Fs •  MC2  •  XVFs 
|Mx(/.op)  — »  Xo/  •  Fs  •  MCI  •  ZRs;  E2 
^stx(j.op)  — >  Xo/  •  Fs  •  MC2  •  ZWs;  £2 
|Ma(j  .op)  — »  Xo/  •  F”s  •  MC3  •  ZRs-,  E2 
^stpc(j.op)  — ►  Xpc  •  Ys  •  ACladd  •  ZAs;  E2 
j  jmp(j.op)  —>  Ypc»Ys;  E2 

jj&rc/i(/.op)  — *  jF?/;  [cond(/,/.cc)  — >  PCA1;  PCA2 
[-icorcd(/,j.cc)  — *  skip 
};E2 

}] 


Figure  9.3:  The  final  program,  first  part 
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ALU  =  *\[AC  — >  AClop*  Xlx  •Yly; 

(z,  f)  :=  aluf(x ,  y,  op,  /);  ZA\z 

-  Flf 

]] _ 

MU  =  *[  [MCI  -»  Xlx  •  Yly  •  MCI;  ma  :=  x  +  y;  MDllw;  ZMlw 
| MC2  ->  XI x  •  Yly  •  MC2  •  ZMlw ;  ma  :=  x  +  y;  MDslw 
| MC3  -*  Xlx  •  Yly  •  MCZ ;  ma  :=  x  +  y;  ZM\ma 
])  _ 

DM  EM  =  *[[MDl  MDl\dmem[ma ] 

— *  MDs?dme7n[ma] 

]]  _ 

i^J5G[fc]  =  (*[[-n6/s  /\k  =  j.xf\Xs_-*  X'.r  •  Xs]] 

||*[[->6/c  A  k  —  j.y  A  Ys  — >  F!r  •  y s]] 
jj*[[-ibfc  A  fc  =  j.z  A  -»  ZMlr  •  ZWs]} 

||=)![[-i6/:  A  k  —  j.z  A  ZAs  — *  bk  f;  ZAs;  ZAlr;  bk  J,]] 

||*[[-ib/c  A  k  =  j.z  A  ZRs  — ►  bk  f;  ZRs;  ZMlr;  bk  j,]] 

) 


Figure  9.4:  The  final  program,  second  part 


In  the  second  design,  we  have  refined  the  protocol  to  block  FETCH  only  when 
the  next  instruction  is  a  new  offset. 

Postponing  the  start  of  the  next  cycle  in  FETCH  is  achieved  by  postpon¬ 
ing  the  completion  of  the  previous  cycle,  i.e.,  by  postponing  the  completion  of 
the  communication  action  on  channel  E.  As  in  the  case  of  the  PCI  commu¬ 
nication,  E  is  decomposed  into  two  communications,  El  and  E 2.  Again,  El 
and  E2  are  implemented  as  the  two  halves  of  the  same  handshaking  protocol. 

In  FETCH ,  EM  is  replaced  with  El\i;E2.  In  EXEC,  E2  is  postponed 
until  after  either  Xofl offset  or  a  complete  execution  of  a  pc  instruction  has 
occurred. 


9.6  Sharing  Registers  and  Buses 

A  bus  is  used  by  two  processes  at  a  time,  one  of  which  is  a  register  and  the 
other  is  EXEC ,  MU,  ALU ,  or  PC  ADD.  We  therefore  decided  to  introduce 
enough  buses  so  as  not  to  restrict  the  concurrent  access  to  different  registers. 
For  instance,  AL  U  writing  a  result  into  a  register  should  not  prevent  MU  from 
using  another  register  at  the  same  time. 

The  four  buses  correspond  to  the  four  main  concurrent  activities  involving 
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the  registers.  The  X  bus  and  the  Y  bus  are  used  to  send  the  parameters  of  an 
ALU  operation  to  the  ALU,  and  to  send  the  parameters  of  address  calculation 
to  the  memory  unit.  We  also  make  opportunistic  use  of  them  to  transmit  the 
pc  and  the  offset  to  and  from  PCADD. 

The  ZA  bus  is  used  to  transmit  the  result  of  an  ALU  operation  to  the 
registers.  The  ZM  bus  is  used  by  the  memory  unit  to  transmit  data  between 
the  data  memory  and  the  registers. 

We  make  a  virtue  out  of  necessity  by  turning  the  restriction  that  registers 
can  be  accessed  only  through  those  four  buses  into  a  convenient  abstraction 
mechanism.  The  ALU  uses  only  the  X,  Y,  and  ZA  ports  without  having  to 
reference  the  particular  registers  that  are  used  in  the  communications.  It  is 
the  task  of  EXEC  to  reserve  the  X,  Y ,  and  ZA  bus  for  the  proper  registers 
before  the  ALU  uses  them. 

The  same  holds  for  the  MU  process,  which  references  only  X ,  Y,  and  ZM. 
An  additional  abstraction  is  that  the  X  bus  is  used  to  send  the  offset  to  MU, 
so  that  the  cases  for  which  the  first  parameter  is  i.x  or  offset  are  now  identical, 
since  both  parameters  are  sent  via  the  X  bus. 

9.6.1  Exclusive  Use  of  a  Bus 

Commands  Xpc,  Ypc,  and  Xof  are  used  by  EXEC  to  select  the  X  and  Y  buses 
for  communication  of  pc  and  offset.  Commands  Xs,  Ys,  and  ZAs  are  used  by 
EXEC  to  select  the  X,  Y.  and  ZA  buses,  respectively,  for  a  register  that  has 
to  communicate  with  the  ALU  as  part  of  the  execution  of  an  ALU  instruction. 

Two  commands  are  needed  to  select  the  ZM  bus:  ZW s  if  the  bus  is  to 
be  used  for  writing  to  the  data  memory,  and  ZRs  if  the  bus  is  to  be  used  for 
reading  from  the  data  memory. 

Let  us  first  solve  the  problem  of  the  mutual  exclusion  among  the  different 
uses  of  a  bus.  As  long  as  we  have  only  one  ALU  and  one  memory  unit,  no 
conflict  is  possible  on  the  ZA  and  ZM  buses,  since  only  the  ALU  uses  the 
ZA  bus,  and  only  the  memory  unit  uses  the  ZM  bus.  But  the  X  and  Y  buses 
are  used  concurrently  by  the  ALU,  the  memory  unit,  and  the  pc  unit. 

We  achieve  mutual  exclusion  on  different  uses  of  the  X  bus  as  follows. 
(The  same  argument  holds  for  Y.)  The  completion  of  an  X  communication 
is  made  to  coincide  with  the  completion  of  one  of  the  selection  actions  Xs, 
Xof ,  Xpc;  and  the  occurrences  of  these  selection  actions  exclude  each  other 
in  time  inside  EXEC  since  they  appear  in  different  guarded  commands. 

This  coincidence  is  implemented  by  the  bullet  command:  We  recall  that, 
for  arbitrary  communication  commands  U  and  V  inside  the  same  process, 
U  •  V  guarantees  that  the  two  actions  are  completed  at  the  same  time.  We 
then  say  that  the  two  actions  coincide.  The  use  of  the  bullets  XlpctXpc  and 
Xloffset  •Xof  inside  PCADD  ,  and  X\r  »Xs  inside  the  registers  enforces  the 
coincidence  of  X  with  Xpc,  Xof,  and  Xs,  respectively.  The  bullets  in  EXEC, 
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ALU,  and  MU  l lave  been  introduced  for  reasons  of  efficiency:  Sequencing  is 
avoided. 


9.7  Register  Selection 

Command  Xs  in  EXEC  selects  the  X  bus  for  the  particular  register  whose 
index  k  is  equal  to  the  field  i.x  of  the  instruction  i  being  decoded  by  EXEC, 
and  analogously  for  commands  Y s,  ZAs ,  ZRs ,  and  ZWs. 

Each  register  process  REG[k\,  for  0  <  k  <  16,  consists  of  five  elementary 
processes,  one  for  each  selection  command.  The  register  that  is  selected  by 
command  Xs  is  the  one  that  passes  the  test  k  =  i.x.  This  implementation 
requires  that  the  variable  i.x  be  shared  by  all  registers  and  EXEC.  An  al¬ 
ternative  solution  that  does  not  require  shared  variables  uses  demultiplexer 
processes.  (The  implementations  of  the  two  solutions  are  almost  identical.) 

The  semicolons  in  the  last  two  guarded  commands  of  REG[k]  are  intro¬ 
duced  to  pipeline  the  computation  of  the  result  of  an  ALU  instruction  or 
memory  instruction  with  the  decoding  of  the  next  instruction. 


9.7.1  Mutual  Exclusion  on  Registers 

A  register  may  be  used  in  several  arguments  ( x ,  y,  or  z)  of  the  same  instruc¬ 
tion,  and  also  as  an  argument  in  two  successive  instructions  whose  executions 
may  overlap.  We  therefore  have  to  address  the  issue  of  the  concurrent  uses  of 
the  same  register.  Two  concurrent  actions  on  the  same  register  are  allowed 
when  they  are  both  read  actions. 

Concurrency  within  an  instruction  is  not  a  problem:  X  and  Y  communi¬ 
cations  on  the  same  register  may  overlap,  since  they  are  both  read  actions, 
and  Z  cannot  overlap  with  either  X  or  Y  because  of  the  sequencing  inside 
ALU  and  MU. 

Concurrency  in  the  access  to  a  register  during  two  consecutive  overlapping 
instructions  (one  instruction  is  an  ALU  and  the  other  is  a  memory  instruction) 
can  be  a  problem:  Writing  a  result  into  a  register  (a  ZA  or  a  ZR  action)  in 
the  first  instruction  can  overlap  with  another  action  on  the  same  register  in 
the  second  instruction.  But,  because  the  selection  of  the  z  register  for  the 
first  instruction  takes  place  before  the  selection  of  the  registers  for  the  second 
instruction,  we  can  use  this  ordering  to  impose  the  same  ordering  on  the 
different  accesses  to  the  same  register  when  a  ZA  or  ZR  is  involved. 

This  ordering  is  implemented  as  follows:  In  REG[k],  variable  bk  (initially 
false)  is  set  to  true  before  the  register  is  selected  for  ZA  or  ZR,  and  it  is  set 
back  to  false  only  after  the  register  has  been  actually  used.  All  uses  of  the 
register  are  guarded  with  the  condition  ->bk.  Hence,  all  subsequent  selections 
of  the  register  are  postponed  until  the  current  ZA  or  ZR  is  completed. 
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We  must  ensure  that  bk  is  not  set  to  true  before  the  register  is  selected 
for  an  X  or  a  Y  action  inside  the  same  instruction,  since  this  would  lead  to 
deadlock.  We  omit  this  refinement  which  does  not  appear  in  the  program  of 
Figures  9.3  and  9.4. 


9.8  Conclusion 

Instruction  pipelining  has  been  approached  as  a  concurrent  programming 
problem:  Starting  with  a  sequential  program  for  the  processor,  concurrency 
is  introduced  through  a  series  of  program  transformations.  However,  al¬ 
though  the  transformations  are  guided  by  the  intent  to  overlap  the  important 
phases — fetch,  decode,  execute — of  instruction  execution,  they  are  neither  me¬ 
chanical  nor  unique.  The  designer  decides  how  to  decompose  a  program  into 
several  concurrent  ones.  We  do  not  claim  that  our  solution  in  this  first  design 
is  in  any  way  optimal. 

Since  the  choice  of  an  instruction  set  was  not  part  of  the  experiment,  our 
design  should  be  judged  in  two  ways:  the  choice  of  the  concurrent  program 
of  Figures  9.3  and  9.4,  and  its  implementation.  The  implementation,  which 
is  described  in  [7],  is  satisfactory,  but  not  optimal.  The  sizing  of  transistors 
can  be  improved  and  the  number  of  transitions  can  be  decreased,  mainly  by 
a  better  placement  of  inverters.  For  instance,  the  delays  due  to  the  control 
for  a  buffer  are  both  about  twice  their  theoretical  minimum. 

The  program  represents  the  choice  of  a  pipeline,  and  of  synchronization 
techniques  to  implement  it,  We  have  deliberately  chosen  a  simple  pipeline. 
In  particular,  the  mechanism  for  stalling,  which  places  part  of  the  decoding 
in  series  with  the  fetch  on  the  critical  path,  sacrifices  efficiency  for  simplicity. 
However,  performance  evaluations  show  that  the  pipeline  is  well-balanced 
since  the  different  stages  have  comparable  average  delays.  Improving  the 
critical  path  by  overlapping  fetch  and  decode  requires  improving  the  ALU 
and  memory  instruction  execution  stages  by  pipelining  parts  of  these  stages. 

The  practicality  of  overlapping  ALU  and  memory  instruction  executions 
remains  an  open  issue.  It  is  not  clear  whether  the  gain  in  performance  is 
worth  the  complexity  of  the  synchronization  involved  and  the  requirement  of 
two  separate  Z  buses. 

We  find  the  synchronization  techniques  used  to  implement  the  concurrent 
activities  between  the  different  stages  of  the  pipeline  particularly  elegant  and 
efficient,  since  the  delays  incurred  in  a  synchronization  can  be  of  arbitrary 
length  and  vary  from  instruction  to  instruction. 

We  foresee  excellent  performances  for  asynchronous  processors  as  the  fea¬ 
ture  size  keeps  decreasing.  But  the  designer  must  be  ready  to  use  new  methods 
based  on  concurrent  programmming,  in  order  to  exploit  asynchronous  tech¬ 
niques  to  their  fullest. 


Chapter  10 


The  Limitations  to 
Delay-Insensitivity 

10.1  Introduction 

In  this  chapter,  we  characterize  the  class  of  circuits  that  are  entirely  DI, 
and  we  show  that  this  class  is  surprisingly  limited:  Practically  all  circuits  of 
interest  fall  outside  the  class  since  closed  circuits  inside  the  class  may  contain 
only  C-elements  as  multiple-input  operators. 

We  prove  that  all  DI  circuits  have  to  fulfill  the  so-called  Unique-Successor- 
Set  criterion;  and  we  show  that  the  class  of  circuits  that  meet  this  criterion 
is  very  limited.  We  also  give  a  characterization  of  the  class  of  computations 
that  admit  a  DI  implementation.  Finally,  we  discuss  what  we  consider  to  be 
the  weakest  compromise  to  delay-insensitivity,  namely,  iso  chronic  forks. 


10.2  Circuits  as  Networks  of  Gates 

A  DI  circuit  is  a  network  of  logical  operators,  or  gates.  A  gate  has  one  or 
more  Boolean  inputs  and  one  Boolean  output.  (Later,  we  will  introduce  gates 
with  multiple  outputs.)  The  state  of  the  circuit  is  entirely  characterized  by 
the  values  of  the  input  and  output  variables  of  the  gates. 

We  assume  that  all  circuits  are  closed :  Each  variable  of  a  circuit  is  the 
input  of  a  gate  and  also  the  output  of  a  gate.  An  open  circuit  is  transformed 
into  a  closed  one  by  representing  the  environment  of  the  circuit  as  gates. 

A  gate  with  output  variable  z  is  defined  by  the  two  production  rules: 

Bu  h-*  z  f 
Bd  ^  zi 
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We  will  assume  that  a  guard  is  in  disjunctive-normal  form,  that  is,  it  is 
either  a  literal ,  a  term,  or  a  disjunction  of  terms.  A  literal  is  a  variable  or  its 
negation;  a  term  is  a  conjunction  of  literals. 

The  two  PRs  of  a  gate  must  fulfill  the  non-interference  requirement.  A  gate 
is  a  partial  function  when  the  non-interference  requirement  is  not  a  tautology 
but  has  to  be  maintained  as  a  program  invariant.  The  flip-flop  is  an  example 
of  such  a  gate. 

The  non-interference  requirement  eliminates  the  most  obvious  case  of  mal¬ 
functioning  of  a  gate.  But  other  forms  of  malfunctioning,  usually  called  haz¬ 
ards,  have  to  be  eliminated  as  well.  A  hazard  is  an  incomplete  transition  on 
the  output  of  a  gate  caused  either  by  two  consecutive  transitions  on  one  input 
variable  or  by  some  concurrent  changes  on  several  input  variables.  In  our 
model,  all  occurrences  of  hazards  are  eliminated  by  the  stability  requirement. 

(The  stability  of  the  physical  implementation  of  a  PR  also  requires  that 
the  changes  in  value  of  the  physical  quantity — voltage,  in  MOS  technology — 
representing  the  Boolean  values  be  monotonic.  However,  monotonicity  around 
the  stable  values  is,  in  general,  neither  attainable,  because  of  noise,  nor  nec¬ 
essary.) 

If  a  circuit  fulfills  the  non-interference  and  stability  criteria,  no  glitch  or 
hazard  can  corrupt  the  value  of  the  variables.  At  any  point  in  time,  the 
physical  quantity  representing  a  variable  either  has  one  of  the  two  stable  values 
representing  the  two  Boolean  values,  or  is  monotonically  changing  from  one 
stable  value  to  the  other. 

Any  pair  of  PRs  that  set  and  reset  the  same  output  variable  defines  a 
valid  gate,  with  the  exception  of  self-invalidating  PRs.  A  rule  with  guard  g 
and  result  r  is  self-invalidating  if  r  =>  -<g  may  hold  as  a  postcondition  of  a 
transition  of  that  rule.  In  other  words,  the  execution  of  the  rule  may  falsify 
the  guard.  For  example,  the  rules  x  i-*  x  J,  and  ->x  i— >  x  f  are  self-invalidating. 

It  is  always  possible  to  modify  the  guard  of  a  PR  so  that  it  does  not  contain 
the  output  variable  of  the  gate.  (This  is  achieved  by  removing  all  terms  that 
contain  the  result  as  literal.  For  example,  (a;  A  z)  V  y  h*  z  |  can  be  replaced 
with  y  i-»  z  t,  since  an  execution  of  the  PR  in  the  state  where  x  A  z  holds  is 
vacuous.) 

Hence,  gates  do  not  contain  variables  that  are  both  input  and  output  (self¬ 
loops).  In  the  sequel,  unless  specified  otherwise,  an  execution  of  a  PR  is  an 
effective  execution. 

10.2.1  Wires 

A  priori ,  a  wire  with  input  x  and  output  y  is  the  gate  defined  by  the  PRs 
x  i-»  y  j  and  -«  h  j(|,  But,  since  the  composition  of  any  gate,  including 
a  wire,  with  a  wire  is  the  gate  itself  with  one  of  its  variables  renamed,  we 
can  add  an  arbitrary  number  of  wire  gates  to  a  circuit  definition  without 
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actually  changing  the  circuit.  In  order  to  have  a  unique  network  of  gates  for 
each  circuit,  we  exclude  the  wire  from  the  gates;  a  wire  is  just  a  renaming 
mechanism  for  variables. 

So  far  all  gates  except  the  wire  have  more  inputs  than  outputs,  but  most 
circuits  have  as  many  outputs  as  inputs.  We  must  therefore  reset  the  balance 
by  introducing  at  least  one  gate  with  more  outputs  than  inputs.  This  gate  is 
the  fork. 

10.2.2  Forks  and  Multiple-Output  Gates 

A  fork  has  one  input  and  at  least  two  outputs.  The  fork,  /,  with  input  x  and 
outputs  y  and  z  is  defined  as 

x  i->2/T,zT 

-'x  ^yl,zl 

where  the  comma  means  the  execution  of  the  two  assignments  in  any  order  or 
concurrently.  The  generalization  to  an  arbitrary  number  of  outputs  is  obvious. 
The  gate 

Bu  >-+  x  t 
Bd  •-»  xl 

composed  with  fork  /  is  equivalent  to  the  gate  with  outputs  y  and  z 

Bu  '->yl,z] 

Bd 

Hence,  the  fork  is  just  a  mechanism  for  replicating  the  outputs  of  a  gate  and  for 
defining  gates  with  an  arbitrary  number  of  outputs.  The  following  discussion 
is  somewhat  simplified  if  we  eliminate  the  fork  and  allow  instead  the  type  of 
multiple-output  gates  that  correspond  to  the  composition  of  a  single-output 
gate  and  a  fork.  But  gates  defined  in  this  way  have  an  important  restriction: 
The  effective  execution  of  a  PR  of  a  gate  contains  an  effective  transition  on 
each  output  of  the  gate. 

10.2.3  Summary  of  the  Model 

The  only  restriction  that  these  definitions  and  conventions  introduce  on  the 
class  of  circuits  being  considered  is  the  exclusion  of  gates  with  self-loops  and 
of  arbitration  devices.  Unlike  models  based  on  the  “fundamental  mode”  of 
operation,  several  inputs  of  a  gate  may  change  values  simultaneously  as  long 
as  the  stability  of  the  guards  of  the  PRs  is  preserved. 

Also,  we  do  not  assume  that  the  transitions  are  instantaneous:  A  variable 
value  changes  monotonically  from  the  “bottom”  value  representing  one  logical 
value  to  the  “top”  value  representing  the  other  logical  value,  and  vice-versa. 
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Because  the  transitions  durations  are  finite  but  positive  and  variable,  the 
ordering  of  transitions  in  a  circuit  has  to  be  defined  with  care. 


10.3  Partial  Order  of  Transitions 

The  specification  of  a  sequential  circuit  defines  a  partial  order  of  actions  taken 
from  a  repertoire  of  commands.  In  order  to  assert  that  a  circuit  fulfills  a 
specification,  we  must  relate  this  partial  order  to  some  other  order  relation 
among  transitions  of  the  circuit.  The  partial  order  of  transitions  is  defined  as 
follows. 

Consider  an  effective  execution  of  a  PR  causing  the  transition  t,  and  let 
C  be  a  term  of  the  guard  such  that  C  holds  for  this  execution  of  the  PR. 

We  attach  to  C  a  set,  T,  of  transitions  in  the  following  way.  Each  literal 
of  C  uniquely  defines  a  transition:  The  literal  x  is  the  result  of  a  transition 
of  type  x  and  the  literal  -<x  is  the  result  of  a  transition  of  type  x  J,.  (The 
initialization  of  a  variable  is  also  considered  a  transition.)  By  definition,  we 
say  that  transition  t  is  a  successor  of  each  transition  ofT.  In  other  words,  a 
transition  is  the  successor  of  the  set  of  transitions  that  make  the  guard  true, 
including  initializations. 

For  example,  if  the  PR  is  x  A  y  i— >  z  t,  we  say  that  each  transition  z  f  is 
the  successor  of  a  transition  x  f  and  of  a  transition  y  j. 

If  the  guard  of  the  PR  is  of  the  form  AVB ,  the  transition  is  the  successor  of 
the  set  of  transitions  that  make  A  true,  or  of  the  set  of  transitions  that  make 
B  true.  Hence,  the  successor  relation  defined  is  not  unique  for  a  given  circuit. 
A  computation  is  a  particular  successor  relation  on  a  set  of  transitions,  such 
that  each  computation  corresponds  to  a  possible  execution  of  the  circuit.  The 
set  of  transitions  of  a  computation  is  finite  if  the  corresponding  execution  of 
the  circuit  terminates,  and  possibly  infinite  otherwise. 

^From  the  successor  relation,  we  can  now  construct  a  relation  -<  that  is 
a  pre-order;  that  is,  it  is  transitive  and  anti-reflexive.  Once  we  have  the  pre- 
order  relation  -<,  we  construct  the  partial  order  -<  by  defining  tl  <  t2  to  mean 
tl  -*!  t2  or  tl  =  t2. 

Transitivity.  For  any  two  transitions  tl  and  1 2,  we  say  that  tl  -<  t2  when 
t2  is  a  successor  of  tl,  or  there  exists  a  transition  1 3  such  that  tl  -<  t3  and 
t3  -<  t2. 

Anti-reflexivity.  t  <t  holds  for  no  transition  t. 

remark:  Anti-reflexivity  is  satisfied  if,  for  each  ring  of  gates  in  the  circuit, 
there  is  always  at  least  one  PR  whose  guard  is  true  and  whose  result  is  false — 
the  ring  “oscillates.”  Anti-reflexivity  excludes  rings  of  gates  that  are  used  to 
maintain  constant  values  of  variables,  as  in  cross-coupled  device  constructions 
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of  storage  elements.  We  therefore  assume  that  the  storage  elements  are  parts 
of  “perfect  wires,”  so  to  speak,  that  keep  the  value  of  a  variable  until  the  next 
transition  on  the  variable.  Q 

Definition.  A  chain  from  a  to  b  is  a  finite,  non-empty  set  {tj,0  <  i  <  n}  of 
transitions  such  that  to  =  a,  tn  —  b,  and  for  all  i,  0  <  i  <  n,  t{  is  a  successor 
of  U-\.  By  construction ,  a  A  b  means  that  there  is  a  chain  from  a  to  b.  If 
a  -<  b,  we  say  that  b  follows  a. 


10.4  Implementation  of  Stability 

Consider  again  an  execution  of  a  PR,  with  guard  B  and  transition  t.  Either  B 
is  never  falsified  once  it  holds,  but  then  t  is  the  last  transition  on  the  variable 
involved,  and  we  say  that  the  transition  is  final.  Or  B  is  falsified  after  a  finite 
number  of  transitions  following  t.  in  which  case,  in  order  to  implement  the 
stability  of  B,  we  have  to  see  to  it  that  t  is  completed  before  B  is  falsified. 

For  all  transitions  i  that  falsify  B,  we  have  to  guarantee  t  -<  i.  Hence,  by 
definition  of  the  order  relation,  there  must  be  a  transition  s  such  that  s  is  a 
successor  of  f,  and  s  A  i.  We  say  that  s  acknowledges  t.  Hence,  the 

Acknowledgment  Theorem.  In  a  DI  circuit,  each  non- final  transition  has 
a  successor  transition. 

By  construction  of  multiple-output  gates,  we  have  the 

Corollary.  In  a  DI  circuit,  a  non-final  transition  on  an  input  of  a  gate  has  a 
successor  transition  on  each  output  of  the  gate. 

EXAMPLE:  Consider  the  three  following  gates  with  two  inputs,  x  and  y, 
and  one  output,  z.  The  Sip-Sop  is  defined  as  z  h  and  ~>y  *— >  z  I,  the 
asymmetric  C-element  as  x  A  y  >— >  z  f  and  -> y  i— *  z  J.,  and  the  switch  as 
x  A  y  e-*  z  1  and  x  A  -ry  e-+  z  j. 

Since  no  guard  of  these  gates  has  a  term  containing  the  literal  ->x,  a  tran¬ 
sition  of  type  x  l  has  no  successor.  Hence,  according  to  the  Acknowledgment 
Theorem,  there  can  be  at  most  two  transitions  on  x  in  any  computation  of  a 
DI  circuit  using  any  of  these  three  gates.  [] 

10.5  The  Unique-Successor-Set  Criterion 

Later  on,  we  shall  give  a  simple  criterion  for  deciding  whether  a  given  circuit — 
a  network  of  gates — is  DI.  But  such  a  criterion  does  not  tell  us  whether  there 
exists  a  DI  circuit  for  a  given  specification.  We  shall  therefore  formulate 
a  more  general  theorem  that  characterizes  the  partial  orders  of  transitions 
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that  admit  a  DI  implementation.  This  criterion  enables  us  to  decide  that  a 
program  has  no  DI  implementation  without  having  to  construct  a  circuit. 

Successor  Set.  In  a  computation,  the  successor  set  of  a  transition  t  is  the 
set  of  variables  x  such  that  a  transition  on  x  is  a  successor  of  t. 

Unique-Successor-Set  Property.  A  computation  has  the  unique-successor- 
set  (USS)  property  when  all  non-final  transitions  on  the  same  variable  have 
the  same  successor  set.  A  set  of  computations  has  the  USS  property  when  all 
non-final  transitions  on  the  same  variable  have  the  same  successor  set  in  all 
computations  of  the  set. 

Unique-Successor-Set  Theorem.  A  set  of  computations  of  a  DI  circuit 
has  the  USS  property. 

Proof.  Consider  an  arbitrary  variable  a:  of  a  DI  circuit,  By  the  corollary  of 
the  Acknowledgment  Theorem,  any  non-final  transition  t  on  x  has  a  successor 
transition  on  each  output  of  the  gate,  say  G,  of  which  x  is  an  input. 

By  definition  of  the  successor  set,  the  set  of  output  variables  of  G  is  the 
successor  set  of  t.  But  since  the  set  of  output  variables  of  a  gate  is  unique, 
the  successor  set  is  the  same  for  all  non-final  transitions  on  x.  [] 

10.6  Characterization  of  DI  computations 

Although  the  Unique-Successor-Set  Theorem  is  a  direct  consequence  of  the 
Acknowledgment  Theorem,  its  formulation  in  terms  of  computations  instead 
of  gates  makes  it  possible  to  lift  the  result  from  the  implementation  level  to 
the  specification  level.  Since  the  partial  orders  of  actions  defining  a  circuit,  are 
projections  of  the  partial  orders  of  actions  implementing  it,  we  shall  investigate 
whether  the  USS  property  is  maintained  by  projection. 

Definition.  Given  a  computation,  c,  on  a  set  of  variables,  V,  the  projection 
of  c  on  a  subset,  W,  ofV  is  the  computation  derived  from  c  by  removing  all 
transitions  on  variables  of  U\W  from  the  chains  of  c.  The  projection  of  a  set 
of  computations  is  the  set  obtained  by  projecting  each  element  of  the  original 
set. 

Projection  Theorem.  If  a  set  of  computations  has  the  USS  property,  then 
its  projection  on  a  subset  of  variables  has  the  USS  property. 

Proof.  By  definition,  the  projection  of  a  set  of  computations  on  W  can  be 
obtained  by  removing  the  elements  of  V\W  one  for  one  from  all  chains  of  each 
computation  of  the  set.  We  prove  the  theorem  by  showing  that  removing  all 
transitions  on  one  variable,  say,  w ,  maintains  the  USS  property  of  the  set. 
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Let  x  be  another  variable,  and  let  X  be  the  USS  of  (all  transitions  on) 
x  in  all  computations  of  the  set.  Either  w  does  not  belong  to  X  and  X  is 
left  unchanged  by  the  transformation,  or  w  is  removed  from  X.  But  then,  for 
each  transition  tx  on  x,  the  successor  set  of  the  transition  on  w  that  follows 
tx  must  be  added  to  the  successor  set  of  tx.  Since  all  transitions  on  w  have 
the  same  successor  set  in  all  computations  of  the  set,  the  new  X  is  the  same 
for  all  transitions  and  all  computations  of  the  set.  Q 

10.6.1  Example:  One-Place  Buffer 

The  cyclic  program  *[X;  Y],  where  X  and  Y  are  communication  commands,  is 
called  a  one-place  buffer1.  It  is  a  basic  building  block  of  asynchronous  circuit 
design  since  it  is  used  to  implement  the  sequencing  of  any  two  actions.  With 
a  four-phase  handshaking  protocol  for  implementing  the  communications,  an 
expansion  of  the  program  in  terms  of  elementary  variables  is: 

*l[x*] ;  XO  t ;  [~>x  i] ;  xo  J. ;  yo  T ;  [yi] ;  yo  1 ;  , 

where  xi  and  yi  are  the  input  variables,  and  xo  and  yo  are  the  output 
variables2.  (See  Figure  1.)  The  environment  of  the  circuit  can  be  simply 
modeled  as  the  two  programs: 

*[xi  f;  [xo];  xi  f ;  [-ixo]] 

i]- 

These  three  programs  are  concurrent.  Now  observe  that  the  projection  of  a 
computation  on  the  output  variables  of  the  first  program  gives  the  computa¬ 
tion  described  by  the  program 

*\xoR,xo[\yoUyo  |]. 

Obviously,  this  computation  does  not  have  the  USS  property;  therefore,  by 
the  Projection  Theorem,  the  closed  circuit  implementing  the  three  programs 
is  not  DI.  But  the  two  environment  programs  can  be  implemented  with  an 
inverter  gate  and  an  identity  gate,  which  are  DI  circuits.  Hence,  there  is  no 
DI  circuit  implementing  this  version  of  the  one-place  buffer  with  four-phase 
handshaking. 

We  can  state  a  more  general  result.  We  observe  that,  for  whatever  four- 
phase  handshaking  is  chosen  for  X  and  Y ,  the  projection  on  the  output  vari¬ 
ables  is  always  *[xo  p  am  f;  yo  p  yo  J,],  unless  the  handshaking  actions  of  X  are 
reordered  (“shuffled”)  with  respect  to  the  handshaking  actions  of  Y .  Hence, 
the 

1The  notation  *[S]  stands  for  the  non-terminating  repetition  of  the  program  S . 

2For  an  arbitrary  Boolean  expression  B,  the  command  ; B j  is  a  shorthand  notation  for 
\B  — >  skip],  and  can  be  informally  defined  as  “wait  until  B  holds.” 
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Figure  10.1:  A  one-place  buffer  and  its  interface 


Theorem.  There  is  no  DI  circuit  implementing  a  one-place  butler  with  un- 
shuflled  four-phase  handshaking. 

We  can  shuffle  the  handshaking  actions  of  X  with  respect  to  the  hand¬ 
shaking  actions  of  Y,  so  that  the  projection  on  the  output  variables  is  the 
sequence 

*[a;ot;yoT;xo|;i/oi]. 

Now,  the  sequence  has  the  USS  property,  and  we  can  implement  the  one-place 
buffer  as  a  DI  circuit.  An  example  is  shown  in  Figure  2. 


10.7  Specifications  and  the  USS  Property 

The  Projection  Theorem  is  very  useful  because  we  can  also  define  when  a 
specification  has  the  USS  property.  If  a  specification  does  not  have  the  prop¬ 
erty,  we  can  immediately  conclude  that  there  exists  no  DI  implementation  of 
the  specification.  The  projection  from  implementation  to  specification  occurs 
as  follows. 

We  assume  that.,  whatever  specification  notation  is  used,  whether  pro¬ 
grams,  traces,  or  regular  expressions,  it  is  possible  to  derive  from  the  speci- 
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Figure  10.2:  A  DI  circuit  for  the  one-place  buffer 


fication  certain  properties  of  the  partial  order  of  actions  involved.  Hence,  in 
the  sequel,  a  specification  is  a  set  of  partial  orders  of  actions,  where  an  action 
is  an  execution  of  a  command  taken  from  some  given  repertoire. 

We  also  assume  that  an  elementary  variable  can  be  uniquely  identified  with 
(the  implementation  of)  each  command:  The  transitions  on  the  variable  occur 
only  in  the  executions  of  the  command,  and  each  execution  of  the  command 
contains  a  transition  on  the  variable.  This  (in  theory,  slightly  restrictive) 
assumption  is  needed  only  for  the  following 

Specification  Theorem.  If  the  specification  of  a  circuit  does  not  have  the 
USS  property,  the  circuit  is  not  DI. 

Proof.  Consider  a  specification,  5,  of  a  circuit.  For  each  command,  X ,  of 
S.  we  substitute  a  transition  on  the  elementary  variable  x  that,  is  uniquely 
associated  with  X.  We  obtain  a  set,  s,  of  partial  orders  of  transitions  on 
elementary  variables.  Since  the  existence  of  the  USS  property  is  independent 
of  whether  the  transitions  are  upgoing  or  downgoing  (that  is,  the  “direction” 
of  the  transitions),  we  can  decide  whether  s  has  the  USS  property  even  though 
the  direction  of  the  transitions  in  s  is  undefined. 

By  definition,  we  say  that  specification  S  has  the  USS  property  if  and 
only  if  the  set,  s,  thus  defined  has  the  USS  property.  By  construction,  s  is  a 
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projection  of  the  set  of  computations  of  the  circuit  specified  by  S.  Hence,  by 
the  Projection  Theorem  and  the  USS  Theorem,  if  s  does  not  have  the  USS 
property,  the  circuit  is  not  DI.  □ 

examples:  The  following  examples,  which  we  give  without  proofs,  show 
how  limited  is  the  class  of  programs  that  admit  a  DI  implementation.  (In 
the  examples,  all  commands  are  different  from  skip.)  We  assume  that  the 
semantics  of  the  program  notation  are  clear  enough  that  we  can  identify  the 
programs  with  the  partial  order  of  actions  they  represent. 

•  Let  P  =  *[Si;  S2; . . .  Sn],  and  assume  that  there  is  no  equivalent  program 

*[Si;Sa;...Sfc] 

with  k  <  n.  (We  say  that  P  is  a  minimal  representation.  For  instance,  *[-A;  X] 
is  not  minimal  since  *[X]  is  an  equivalent  program.) 

Then  P  has  the  USS  property  if  and  only  if  Si  ^  Sj  for  i  ^  j.  Hence,  the 
“modulo-2  counter’  *[X;  A;  Y)  and  all  other  "inodulo-k  counters”  have  no  DI 
implementation.  A  similar  result  has  been  proved  by  C.  J.  Seger[22]. 

•  The  program  — *■  S2JH2  ~ with  S2  S3,  does  not  have 

the  USS  property.  Hence,  there  is  no  DI  circuit  implementing  such  a  selec¬ 
tion  command.  Q 

10.8  Gate  Characterization  of  DI  Circuits 

We  have  already  seen  that,  apart  from  the  trivial  case  where  one  input  of 
the  gates  changes  at  most  twice,  there  is  no  DI  circuit  that  contains  either  a 
flip-flop,  or  an  asymmetric  C-element,  or  a  switch.  In  the  same  way,  we  can 
use  the  USS  and  the  Projection  Theorems  to  show  that  there  is  no  DI  circuit 
containing  either  an  or-gate,  or  an  and-gate,  or  an  exclusive-or,  in  which  each 
input  of  the  gates  changes  more  than  a  minimum  number  of  times  specific  to 
each  case.  Consider  an  or-gate  with  inputs  x  and  y  and  output  z.  The  only 
sequence3  in  which  each  transition  on  an  input  is  acknowledged  is: 

((xf;zT;zi;zi)*;(#  t;2T;y  l;zJ,)T 

We  easily  see  that  any  computation  that  contains  a  transition  on  both  inputs 
does  not  have  the  USS  property. 

The  cases  of  the  and-gate  and  of  the  exclusive-or  are  treated  similarly  and 
are  left  as  an  exercise  for  the  reader.  After  having  eliminated  all  gates  with  at 
most  two  inputs  except  the  inverter  and  the  Muller-C  element,  we  are  led  to 
conjecture  that  a  DI  circuit  contains  only  C-elements.  C-elements  are  defined 
as  follows. 

3The  notation  (5)*  is  the  Kleene-star  notation  standing  for  an  arbitrary  number  of 
actions  S  in  sequence. 
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Definition.  An  n-input  gate  in  which  Bu  is  the  conjunction  of  the  n  input 
variables  and  Bd  is  the  conjunction  of  the  negations  of  the  n  input  variables 
is  called  an  n-input  C-element.  A  gate  derived  from  a  C-element  by  negating 
one  or  more  literals  in  Bu  or  B(i  is  also  a  C-element. 


The  Muller-C  element  is  a  two-input  C-element  according  to  our  definition. 
A  one-input  C-element  reduces  to  either  a  wire  or  an  inverter. 

C-Element  Theorem.  If  a  DI  circuit  has  only  one  computation,  and  if  the 
computation  contains  at  least  three  transitions  on  each  variable .  then  the 
circuit  can  be  constructed  with  C-elements  only. 

Proof.  Let  x  be  an  arbitrary  variable  of  the  circuit;  x  is  the  input  of  gate 
g  with  output  z.  We  shall  prove  that  g  can  be  implemented  as  a  C-element. 
Since  there  are  no  self-loops,  x  and  z  are  different  variables. 

First,  observe  that  because  of  the  non-interference,  all  transitions  on  the 
same  variable  are  totally  ordered.  And  because  all  transitions  are  effective, 
upgoing  and  downgoing  transitions  on  the  same  variable  alternate. 

Since  the  circuit  contains  at  least  three  (effective)  transitions  on  each  vari¬ 
able,  at  least  one  transition  of  type  x  |  is  followed  by  a  transition  of  type  x  [, 
and  at  least  one  transition  of  type  x  [  is  followed  by  a  transition  of  type  x  f. 

Let  tl  be  a  transition  of  type  x  j  and  1 2  be  the  transition  of  type  x  j 
following  it.  For  the  guard  of  the  PR  of  tl  to  be  stable,  there  must  be  a 
transition  tz  on  z  such  that  tl  <  tz  <  t2.  We  also  know  that  tz  is  a  successor 
of  tl. 

By  the  USS  Theorem  and  the  Projection  Theorem,  there  is  exactly  one 
transition  tz  on  z  such  that  tl  tz  -<  t2.  By  the  same  argument,  there  is 
exactly  one  transition  on  z  between  a  transition  of  type  x  \  and  the  transition 
of  type  x  t  following  it. 

Without  loss  of  generality,  assume  that  the  first  transition  on  x  is  of  type 
x  \  and  the  first  transition  on  z  is  of  type  z  t-  Then,  because  of  the  alternation 
of  upgoing  and  downgoing  transitions  on  each  variable,  each  transition  of  type 
z  "f  is  the  successor  of  a  transition  of  type  x  ),  and  each  transition  of  type  z 
is  the  successor  of  a  transition  of  type  x  j. 

By  definition  of  the  successor  relation,  x  holds  as  a  precondition  of  each 
transition  z  f;  thus,  guard  Bu  of  g  can  be  formulated  so  that  all  terms  contain 
x,  since  a  term  that  is  never  true  can  be  removed.  Hence,  Bu  can  be  chosen 
of  the  form  x  A  Cu ,  where  Cu  does  not  contain  x.  Symmetrically,  guard  Bd 
of  g  can  be  chosen  of  the  form  ->x  A  Cd ,  where  Cd  does  not  contain  x.  Since 
this  property  of  Bu  and  Bd  holds  for  each  input  of  g,  g  is  a  C-element  or  can 
be  replaced  with  a  C-element.  □ 
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10.9  Isochronic  Forks 

Since  the  class  of  DI  circuits  is  so  limited,  we  must  have  compromised  the 
delay-insensitivity  in  the  circuits  that  we  designed  using  the  synthesis  method 
described,  for  instance,  in  [12]  and  [11],  Let  us  analyze  a  standard  sequencing 
circuit  used  in  this  design  style.  (It  is  similar  to  the  one-place  buffer,  but  is 
simpler  to  use  as  an  example.)  This  circuit  (Figure  3)  is  an  implementation 
of  the  sequence  of  elementary  actions: 

*[[®*];y°T;  hyt]  ;iot;  hxf];u|;[-m];xoj.]. 


Figure  10.3:  A  sequencing  element  containing  isochronic  forks 

The  environment  of  the  circuit  is  the  same  as  that  of  the  one-place  buffer. 
The  x-  and  ^/-variables  are  each  parts  of  a  four-phase  handshaking  sequence, 
and  u  is  a  state  variable — without  u,  it  would  not  be  possible  to  encode  each 
state  of  the  circuit  uniquely.  Since  the  projection  of  this  sequence  on  the 
variables  xo.  yo.  and  u  lacks  the  USS  property,  and  since  the  environment  of 
the  circuit  can  be  implemented  as  an  inverter  and  an  identity,  the  circuit  is 
not  DI. 

In  order  to  find  out  where  we  have  cheated,  we  must  look  at  the  forks. 
We  observe  that,  xi  is  an  input  both  of  the  and-gate  with  output  yo  and  of 
the  C-element.  Hence,  the  circuit  actually  contains  a  fork  with  input  xi  and 
two  outputs,  say.  xl  and  x2.  Similarly,  the  circuit  contains  a  fork  with  input 
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yi,  and  a  fork  with  input,  u.  Let  us  analyze  the  behavior  of  the  first  fork  by 
introducing  it  explicitly  into  the  set  of  PRs  of  the  circuit.  For  the  sake  of 
simplicity,  we  ignore  the  other  two  forks.  We  get: 


xi 

xl  A  -Hi 
x2  A  yi 
->xl  V  u 
-i  yi  A  u 
-i  xi 

-ix2  A  -<yi 
y  i  V  -nr 


t-t  xl  f,x2  f 

•-»  2/0  T 

l->  U  t 

^yo  i 

H+  Xof 
t-+  xl  |,x2  J, 
t — *  u  f 
H- >  XOf 


Transitions  xl  j  and  x2  f  are  both  acknowledged  by  the  two  PRs  that  fol¬ 
low.  But  only  transition  x2  J.  is  acknowledged.  Transition  xl  J.  is  not  acknowl¬ 
edged.  Hence,  the  circuit  is  not  DI,  because  the  Acknowledgment  Theorem  is 
not  satisfied.  Therefore,  the  completion  of  transition  xl  J.  is  not  guaranteed 
unless  we  implement  the  fork  as  an  isochronic  fork,  which  is  defined  as  follows. 

In  an  isochronic  fork,  when  a  transition  on  one  output  is  acknowledged, 
and  thus  completed,  the  transitions  on  all  outputs  are  acknowledged,  and  thus 
completed. 

(We  leave  it  as  an  exercise  to  the  reader  to  check  that  the  fork  with  input 
yi  must  also  be  isochronic,  but  not  the  fork  with  input  u.) 

The  implementation  of  an  isochronic  fork  relies  on  two  types  of  assump¬ 
tions  about  delays.  First,  we  have  to  assume  that  the  difference  between  the 
delays  in  the  branches  of  the  fork  is  negligible  compared  to  the  delays  in  the 
gates.  This  requirement  is  easy  to  meet  in  current  MOS  technology  except 
when  there  is  an  inverter  on  one  branch  of  the  fork  and  not  on  the  other 
branch(es).  The  fork  with  input  yi  has  such  an  inverter,  and  therefore,  the 
inverter  must  be  removed  by  proper  circuit  transformations. 

Second,  and  more  important  in  current  technology,  we  have  to  assume 
that  the  switching  thresholds  in  the  different  gates  to  which  the  fork  is  an 
input  are  close  enough  to  each  other.  This  requirement  is  more  difficult  to 
meet  than  the  first  one  because,  on  the  one  hand,  the  thresholds  of  individual 
transistors  are  difficult  to  control — in  particular  in  CMOS;  on  the  other  hand, 
the  switching  thresholds  of  a  gate  vary  greatly  with  the  logical  design  of  the 
gate.  For  these  reasons,  this  requirement  may  impose  a  design  style  in  which 
all  gates  are  implemented  as  combinational  gates,  so  that  the  fight  between 
pull-up  and  pull-down  during  the  switching  of  the  gate  keeps  the  switching 
threshold  around  VDD/2.  Observe  that,  unlike  what,  is  advocated  in  other 
compromises  to  delay-insensitivity,  enforcing  the  locality  of  the  wires  offers 
little  help  in  implementing  isochronicity  because  locality  is  irrelevant  to  the 
issue  of  threshold  voltages! 
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10.10  For  Whom  the  Bell  Tolls? 

Are  these  results  tolling  the  bell  for  DI  design?  Actually,  not.  At  worst,  they 
may  slightly  embarrass  those  researchers  who  claim  to  have  a  design  method 
for  entirely  DI  circuits.  At  best,  they  vindicate  the  compromises  to  delay- 
insensitivity  adopted  by  several  asynchronous  design  methods.  Most  likely, 
they  are  sobering  reminders  of  the  difficulty  of  VLSI  design  and  the  novelty 
of  asynchronous  design. 

We  have  proved  elsewhere  that  extending  a  standard  repertoire  of  DI  gates 
with  isochronic  forks  is  sufficient  to  construct  any  circuit  of  interest.  The 
proof  consists  in  giving  a  circuit  implementation  for  each  construct  of  the 
progam  notation  we  use  (see  [2]).  I  believe  the  isoclironic  fork  to  be  the 
weakest  possible  compromise  to  delay-insensitivity  in  the  sense  that  all  other 
compromises  also  include  isochronic  forks:  For  instance,  in  speed-independent 
design[19],  all  forks  are  supposed  to  be  isochronic;  in  self-timed  design[23],  all 
forks  inside  a  certain  region — called  an  equipotential  region — are  assumed  to 
be  isochronic. 


Chapter  11 

Conclusion 


We  have  described  a  method  for  implementing  a  concurrent  program  (a  set 
of  communicating  processes)  as  a  network  of  digital  operators  that  can  be 
directly  mapped  into  a  delay-insensitive  VLSI  circuit.  The  circuit  is  derived 
from  the  program  by  applying  a  series  of  systematic,  semantics-preserving 
transformations  that  we  have  compared  to  compiling.  Hence,  the  circuits  are 
correct  by  construction,  and  their  logical  correctness  is  independent  of  the 
delays  in  operators  and  wires,  with  the  exception  of  isochronic  forks. 

The  most  encouraging  aspect  of  the  method  is  that  it  is  really  a  synthesis 
technique:  it  allows  designers  to  construct  solutions  that  they  would  never 
have  found  had  they  not  applied  the  method.  Different  applications  of  the 
transformations  lead  to  different  circuits  for  the  same  program.  Although  all 
circuits  are  semantically  equivalent,  they  may  exhibit  different  behaviors  in 
terms  of  speed  or  size  (number  of  operators  used).  The  method  therefore  in¬ 
cludes  the  trade-offs  between  simplicity  and  efficiency  that  should  be  available 
to  the  VLSI  designer. 

Using  concurrency  to  implement  a  sequential  computation  may  seem  waste¬ 
ful  at  first,  sight.  But  VLSI  is  essentially  a  concurrent  medium:  concurrency  is 
implemented  at  no  cost  by  mere  juxtaposition  of  the  concurrent  parts.  On  the 
other  hand,  implementing  sequencing  requires  synchronization  and  is,  in  gen¬ 
eral,  more  expensive.  We  shall  therefore  implement  sequencing  as  restricted 
concurrency.  Once  a  process  has  been  transformed  into  a  semantically  equiv¬ 
alent  set,  the  problem  of  implementing  sequencing  has  disappeared! 

This  technique  entails  one  of  the  main  novelties  of  the  method.  Other 
techniques  implement  sequencing  by  transforming  the  computation  into  a 
finite-state  machine,  and  realizing  each  state  with  a  state-holding  element. 
In  our  technique,  some  state-holding  elements  may  be  needed,  but  the  num¬ 
ber  of  those  elements  is  drastically  less  than  in  techniques  using  finite-state 
machines. 
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Since  the  issue  of  isochronic  forks  seems  to  have  confused  certain  readers 
of  previous  papers,  let  us  make  clear  a  number  of  points.  First,  most  forks 
need  not  be  isochronic,  as,  for  instance,  the  fork  that  distributes  a  control 
signal  to  all  bits  of  a  register.  Second,  the  isochronicity  requirement  is  easy  to 
meet  when  there  is  no  inverter  on  the  branches  of  the  fork,  and  in  practice,  it 
is  usually  easy  to  move  the  inverters  so  as  to  remove  them  from  the  branches 
of  isochronic  forks.  Third,  isochronic  forks  are  necessary  to  implement  the 
sequencing  of  two  four-phase  handshaking  protocols;  therefore,  methods  that 
claim  to  dispense  with  isochronic  forks  just  hide  them  inside  building  blocks. 

The  proofs  that  the  transformations  preserve  the  semantics  of  the  algo¬ 
rithms  rely  on  properties  of  the  four-phase  handshaking  protocol  with  which 
the  communication  primitives  are  implemented.  Although  rigorous  proofs  of 
these  properties  have  been  omitted,  the  reader  should  have  no  difficulty  in 
being  convinced  of  their  correctness,  and  thus  of  the  correctness  of  the  trans¬ 
formations  performed. 

The  examples  cover  most  constructs  of  the  language  but  not  all  of  them: 
We  have  not  shown  how  to  implement  an  arbitrary  set  of  guards.  Therefore, 
we  have  not  quite  shown  that  any  program  in  the  language  can  be  compiled. 
Such  a  proof  has  been  given  in  [1]  and  [2],  where  the  compilation  of  each 
construct  is  described  as  part  of  the  basic  algorithm  for  an  automatic  compiler. 
It  is  shown  that  any  program  in  a  subset  of  the  language  can  be  implemented 
as  a  delay-insensitive  circuit  using  only  a  small  set  of  basic  elements:  the 
2-input  C-element,  the  2-input  or-gate  or  2-input  and-gate,  the  synchronizer, 
the  inverter,  and  the  isochronic  fork. 

However,  there  is  no  reason  for  confining  the  designer  to  a  minimal  set 
of  operators.  On  the  contrary,  since  an  advantage  of  VLSI  is  the  possibility 
to  create  operators  at  no  cost,  introducing  the  special  purpose  operator  that 
exactly  implements  an  arbitrary  set  of  production  rules  often  simplifies  a 
circuit  drastically. 

In  order  to  convince  the  VLSI  community  of  the  practicality  of  our  method, 
it  was  essential  that  we  fabricated  the  circuits  we  had  designed.  Hence,  all 
significant  examples  that  we  have  used  in  our  research — distributed  mutual  ex¬ 
clusion,  queues,  stack,  routing  automata  for  communication  network,  3X  +  1 
engine,  microprocessor — have  been  fabricated  in  SCMOS  using  the  MOSIS 
foundry  service.  They  have  all  be  found  to  be  correct  on  “first  silicon”. 
They  are  also  very  robust,  and  surprisingly  fast,  given  the  low-level  of  cir¬ 
cuit  optimization  applied.  The  3x  +  1  engine,  constructed  by  Tony  Lee,  is 
a  special-purpose  processor  consisting  of  a  state-machine  and  an  80-bit-wide 
datapath.  It  contains  approximately  40,000  transistors  and  operates  at  over  8 
MIPS  (million  instructions  per  second)  in  2MUm  MOSIS  SCMOS  technology. 

We  have  designed  the  first  asynchronous  general-purpose  microprocessor 
in  CMOS.  The  results  of  this  experiment,  described  in  Chapter  9,  are  very  en- 
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couraging  and  contradict  the  long-held  belief  that  asynchronous  techniques  are 
too  slow  and  too  wasteful  in  area  for  something  as  demanding  as  a  pipelined 
general-purpose  microprocessor.  We  have  just  finished  a  GaAs  version  of  the 
same  microprocessor.  It  is  now  being  fabricated  by  Vitesse  through  the  MO- 
SIS  service.  Although  it  is  too  early  to  report  any  performance  results,  this 
experiment  already  demonstrates  how  easy  it  is  with  such  a  synthesis  method 
to  transport  a  complete  design  across  very  different  technologies. 
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